Gigablast

Gigablast

Type of engine: General web search.

SUMMARY
Relevancy of results: Needs improvement.
Freshness of results: Very good, and not only are the results fresh but the index date is listed next to each result.
Features and functionality: Very straightforward and easy to use.
Quality of help and “about us” pages: The Help sections are OK, but I learned more about using Gigablast from reading articles on other sites about it then by reading their help sections.
Business model: Selling ancillary search services, such as enterprise search. No sponsored links, banner ads, or any other form of advertising presented to the user.

INTRO
I’ve been wanting to do a review of Gigablast for a while because the story of Gigablast is inspiring. Gigablast was written by former Infoseek engineer Matt Wells. Matt has a blog of sorts, but he has not updated it since February, 2004. The blog mostly covers Gigablast, but also has notes about other non-Gigablast issues. I like his candidness, such as “Alright you SEO people. Get your bots off my search results.”
The thing I really like about Gigablast is that it knows what its business is, namely general web search, and it simply goes about doing that. There are not half a dozen search tabs or other distractions. Although they do offer some other services, as I will mention later, you can go to Gigablast and it is as obvious as night and day what you can do there. You go to Gigablast, you enter a search and that’s that.

Right on the front page they list how many sites have been indexed. As of January 12, 2005 that number is 1,014,363,952. Over the past four days the number did not change, but I know it was not too long ago that the number was half what it is now.

UI & FEATURES and QUERY EXAMPLES
Really the only feature that users see on a regular basis is Giga Bits, and so I have decided to write about features and queries together. Giga Bits are related terms that appear at the top of every search result page. They append the original search rather than replacing it. Giga Bits can sometimes be used to find answers to natural language queries. For example, for What is the capital of Sweden?, the first Giga Bit result is Stockholm. I happen to know that answer is correct, but if you did not know that you might not pick up on the fact that the Giga Bit term can actually be the answer. So although it is a nice bonus, the way it is implemented right now will not be clear to many users. After clicking on the Giga Bits suggestion of Stockholm, the first answer is in Swedish and there is nowhere to set user preferences for things like language.

Giga Bits is a nice idea, but the current implementation is confusing and it took me a while to figure it out. I am still not clear on what the percentages next to each term mean so I tend to ignore them. I think engines do themselves a disservice by naming things with cute and clever names like Giga Bits. How about just calling it something so that users immediately know what they are seeing? Like related terms or refine your query, etc.

As I mentioned earlier, one really cool feature Gigablast has is that it tells you when each page was indexed. Another very nice thing is that there are a few options next to each search result. You can click on older copies to go to the Internet Archive (see my Wayback Machine review.) and see archived versions of the site. You can also look at cached archive copies, and a stripped version that takes out images and just leaves text. These are all very nice.

Gigablast also has something that used to be common, but you don’t see so much anymore. And that is an easy link to other search engines’ results. At the bottom of each search page you will see: Try your search on google yahoo alltheweb dmoz alta vista teoma wisenut.

As a side note, Gigablast used to default to OR searches, but now defaults to AND. Seems wise to me, particularly now that their index size is growing.

Time to try another query, such asHow does Gigablast make money?.
The first listing is a meta-search result page with a list of work at home businesses (MLMs), a few of which were pulled from Gigablast. Not so good. I looked through all ten of the results on Gigablast’s first page of results and none of them were relevant. Results 8, 9, and 10 are all the same entry from John Battelle’s blog and the only place the word “Gigablast” appears is in his list of search companies in the menu section of the page.

But the first Giga Bit is intriguing: “arrangement with Google”. So, by using that modifier, we get how does gigablast make money? “arrangement with Google”. The first two results are duplicates, as in the same content and same title, but one is hosted by Research Shelf and the other by Free Pint. A tough thing for engines to detect, but nonetheless a bad user experience because they offer no differentiating content. And plus, the content is not relevant. The site mentions Gigablast and it mentions Google, but I thought maybe I’d find some juicy bit about the two companies working together, but nope, nothing like that. Aggh, but it keeps getting worse because not only listings 1 and 2, but all of 1 – 7 are actually the same site, and they all have the same title too. The fourth result is a TinyURL which probably should not be indexed in the first place. #7 is a redirect that takes you right to the same site. For #8 the page discusses both Gigablast and Google, but offers no insights into them having any kind of agreement together. Afraid I was led on a wild goose chase.

I also noticed that on the top of the results page it says, “Results 1 to 8 of about 113,” but it is interesting the way they have done this. Because at the bottom of the page it says, “No more results found. Show relevant partial matches for your query.” But when I click on that I get “Results 9 to 18 of about 231.” So now I am a confused user. When it said 8 of 113 I thought that the 8 were AND matches and the remainder were partial matches, however when I clicked on the partial match link I got 231 results. What happened to 113? Where did that go? Not a huge deal, especially since I think other engines get these numbers wrong sometimes too, but it is distracting nonetheless.

For the next Giga Bit refinement, how does gigablast make money? “Matt Wells”, the first result is a Search Engine Watch interview with him from September, 2003, and it has a great answer to my question:

Q. From a business perspective, Gigablast carries no advertising? Is this a decision you plan to keep? How does Gigablast make money?

A: Money is derived from selling search services on my products page. At this point I don’t think I’ll put up advertisements unless I need the revenue to support Gigablast or myself.

Sometimes, the Giga Bit suggestions are pretty random, like how does gigablast make money? “going to make decisions”. There are four results to this strange query. Three of them are the same John Battelle posting and the fourth is from Geeking with Greg Linden. Greg’s posting is indeed about Gigablast, but not about how they make money. There are other posts about other companies like Google and Technorati making money, but not about Gigablast making money.

Some comments about these duplicate and non-relevant results. It seems that Gigablast is indexing and keeping multiple snapshots of the same site. For the above example, the duplicate listing was indexed in September, November and then in December. Although the site may have been updated during those months, the page is the same page. There is only one page. The difference is that the URLs are not being normalized. So here is what two of the three indexed URLs look like:
http://www.battellemedia.com/archives/000627.php
http://battellemedia.com/archives/000627.php

As you can tell, they are nearly identical except that one has WWW. In certain rare cases these could actually be different, but in this case they are the same. The third instance of this site is this:
http://www.snipurl.com/62pj

I think Gigablast could do a couple of things to resolve this situation. If they have the bandwidth they could compare page content and if there is enough overlap, and the URL is nearly identical, they could conclude it is the same page and only show it once, or at least offer it as a cluster. I understand though that site content comparisons are tough and the devil is in the details. The other thing that might work is to compare the display text for the results. For all three of them, the displayed text is exactly the same. That should be a dead giveaway. And the last thing is to not index Tiny URLs and Snip URLs at all. Although these are handy features for capturing long URLs, I think engines should follow them through and only gather the original URL. Plus, in this case, the original URLs are not even that long.

The interesting thing about this de-duplicating business is that for their XML feeds Gigablast offers the option of de-duplicating results. However, I could not find a way to add the parameter to a regular search string. If I am missing something here, please let me know because it seems odd that for RSS searching this functionality is implemented, but not for regular web searching.

I also wanted to modify my query a bit more, so I tried Gigablast business model.
Result #1 is a collection of entries on a search engine blog. There is an entry about Gigablast, but the term “business model” pertains to another entry about a different engine. Same with #s 2, 3 and 10, which are all duplicates of each other. #8 and #9 are a different set, but same problems. Worth noting is that both sets of duplicates are from Resource Shelf. Why is that? The results are not relevant and they are also duplicates of each other. But then I hit gold with #6, well not exactly gold, but a direct path to gold. It is a SiteLines posting by Rita Vine referring readers to an article by Gwen Harris from July/August, 2004. And here is exactly the information I have been looking for:

At present Wells runs Gigablast without any keyword-activated advertising: there are no banner ads and no sponsored links. This isn’t a matter of principle — it’s just that advertisements slow down the query response time.
Income comes from selling the technology. As Wells explains, he has “built Gigablast to be more efficient than the other engines” to save on time and hardware. Webmasters will be interested in his product line for creating indexes or integrating Gigablast web search results.

So, great I got my answer. But what happened that Gigablast did not take me right to the info I needed? So close, but instead of taking me right to the answer to my question it took me to a site that linked to the answer. The reason in this instance is quite straightforward. Nowhere in Gwen Harris’ article does she use the words “business” or “model”. But fortunately the referring site used the phrase “business model”. I don’t know what Gigablast could do better in a situation like this. If I had continued to modify my query I eventually would have gotten directly to it because I did find the site in Gigablast’s index.

CONCLUSION
I am a fan of Gigablast. I have a lot of respect for what Matt Wells has done. He set out on his own to write a new search engine and he did just that. He scaled the technology so that he can index a billion pages with less hardware and financial investment than the big engines, though of course he still has a ways to go to catch up with the big boys. It seems the crawling and indexing parts of Gigablast are its main focus and strengths. Unfortunately, it is lagging right now in its algorithm and search results.

As with many engines, Gigablast has created XML feeds, but I would rather see them improving their results before adding extras like this. This happens at other engines as well. Even MSN has been building RSS feeds and their relevance needs some work too. (See my earlier MSN review.) I suppose the reason engines are doing this is because it is easier to implement RSS feeds than to improve the elusive beast called relevancy. Plus it may be that different engineers work on each of these, so adding XML feeds in no way impedes the progress of other initiatives like relevancy. Whatever the case may be, things like RSS search feeds are great, I love them and subscribe to several from Technorati and Google, but they are used by the few, not the many, and they are only as useful as the quality of the engine’s results.

What I saw in Gigablast’s results was the following:
• Many duplicate sets
• Many results from one or two sources
• Poor matching for multiple terms. In other words, Gigablast noticed the words in my query on a page, but the words were spread out and not in proximity to each other.

With some improvements to things like this, Gigablast can be a very nice homegrown engine and alternative to the big fellas.

About Chris

I'm Chris and I've worked in the search engine industry since the late '90s.

View all posts by Chris →