Monthly Archives: February 2005

You are browsing the site archives by month.

Clusty -A Brief Interview About Clustering

Clusty is a web search tool that clusters results. It is a meta-search engine so all results are pulled from other engines. Clustered results help make it easier to quickly get a good understanding of the broad range of information available for certain topics. Clustering also allows users the option to browse by clicking on narrower, broader, or related topics. Vivisimo, the company that makes Clusty, has been clustering results for a while, but recently they decided to switch their web search traffic to Clusty. Vivisimo.com still offers web search, but is really more of a corporate page that describes their enterprise search solutions.

Back in October I wrote a review of Clusty for the Search Lounge.

This email interview was conducted with Director of Marketing Saman Haqqi.

***************************

Although clustering algorithms have been in the information profession for a while, and even Vivisimo and a couple of other companies have been around for several years, why do you think clustering web results hasn’t really gained much attention with the general public until recently?
The benefits of a clustering engine over simple ranking engines are increasingly recognized by the world at large and are validated by the growing traffic at clusty.com and the adoption of the technology by AOL for its new public search site at www.aol.com. Now one in every eight web searches offers clustered results.

Documents can appear in multiple clusters. How does the clustering engine determine the maximum number of clusters a document should appear in? And, just out of curiosity, do you know the average number of clusters per document? (I’d guess between 2 and 3…)
On average, a document can appear in about 1.7 clusters which seems to be very close to Yahoo’s humanly indexed directory where a document can fall in about 1.6 categories.

I notice that Wikipedia results often show up high in the results. Is there any particular reason for this? Are certain sources weighted higher than others?
Since web searchers typically do not use tabs to focus search results, Clusty dynamically gives higher weight to some sources like wikipedia, shopping, weather sites, image databases etc. depending upon the nature of the query. Look for several new key matches to be introduced over the next few months.

I’m curious about cluster depth. The deepest I have seen clusters go is three levels, though generally they seem to max out at two levels deep. Is this something that is pre-set? Or might the depth increase over time as you continue to develop your clustering technology?
The depth of clusters is determined by the number of results in a cluster and the variety amongst them.

How does your company evaluate the relevancy of clusters being returned for searches? Is there a formal process in place for doing this?
We are the most exacting critics of our solution and are always working to improve the quality and relevancy of our clusters.

Vivisimo has a brief, but helpful, white paper called How the Vivisimo Clustering Engine Works. The white paper states that the clustering engine “Does not use a predefined taxonomy or controlled vocabulary…”. Why is that? Wouldn’t it make sense in some cases to overlay the clustering engine’s results on top of subject taxonomies or ontologies? A blended version would allow for scale, while at the same time taking advantage of human categorization.
Clustering overlays well with existing taxonomies and pre-defined categories.

Although not specific to a question I asked, here is some general information about the power of clustering that was included in the response I received.
At Vivisimo we believe that Web searching needs to evolve beyond ‘ranking engines’ that simply list undifferentiated page results ranked by popularity, freshness and links – criteria that don’t do enough to make search results useful to searchers. Clustering lets users view results organized into categories like books organized neatly in bookshelves instead of being randomly piled on the floor. It allows users to quickly overview at least ten times as many results as they would with ranking engines where users rarely go beyond the first page.
Similar results are grouped together so users can not only focus on the result that matches their interest but can also see other similar results easily. A simple ranking engine would have such similar results spread over several pages. Finally, the clustering allows deeper results which would otherwise be buried in later pages to rise to the top.

Key features:
Clusty.com is an implementation of our search engine, clustering engine and content integrator.

The clustering engine processes the text in search result summaries to group the results into folders. This ensures that clusters are returned within milliseconds.

Clusters are ranked based upon an advanced algorithm that accounts for the ranking of results by the underlying search engines, number of results, frequency etc.

A9 Introduces Visual Yellow Pages

Intro
Of course the day after I published my article about Local Search at the big engines, back on January 26th, I heard about A9’s new undertaking in this area. But I figured I’d give things a few weeks to settle down before getting a Search Lounge review out. Although Amazon’s A9 is not one of the ‘big engines’, since they are not doing full-scale web indexing, they are still a formidable player whom I am sure will only get bigger and better as they continue to develop their search products.

A9’s local search, called Yellow Pages, is different from the local services offered by Ask, Google, MSN and Yahoo. The listings data comes from Axiom, but the differentiating factor is that they have collected photographs of businesses. To do this they simply strapped cameras on top of cars and drove around taking photos. Currently, A9 has pictures for ten US cities, but plans to expand that as evidenced by this job posting for Block View Drivers. For more information, here is an article called A9.com brings Yellow Pages to life by adding 20 million images.

I think there should be a field of query analysis that combs search results on Google and Yahoo looking for the word cool and seeing what results come back for different queries that include the word. I did this with cool search engine for Kartoo and got great results, so let’s try it again. On Yahoo the fourth listing for cool yellow pages is A9. And on Google, the search cool yellow pages shows 3 out of the first 10 listings being about A9, although A9 itself does not appear.

Right now, for San Francisco, there are still many businesses that are missing photographs. But A9 provides the ability for users to submit photos. This is particularly useful if you’re the owner of a photograph-less business.

Also, because A9 covered entire blocks at a time with a series of photos, I have noticed that sometimes the default photo for a business is actually a couple doors away from the best photo of the business. Let me explain: when you get to a business with photos, there will be a row of photos, not just one photo. Each photo is like an animation cel that reproduces the effect of walking along the street, except you see them all next to each other instead of melded together like a cartoon. So there might be two or three photos that have at least some part of the targeted business in them. This can be a good thing because it shows different angles and provides visual context of the street. In any case, to correct this situation is very easy, simply click on which photo is the best image. Both times I did it, that image became the default image.

Oh, and the whole thing is free, unlike other solutions like Google’s Keyhole.

Relevancy Tests
The search motorcycle parts in San Francisco had only four results. All four are relevant; they are indeed stores that sell motorcycle parts, though three out of four are for BMWs. (Note: to get to Yellow Page results, click on Yellow Pages on the right hand side of the page. Then click on [full] to expand Yellow Page results across the search result page.) By contrast, Yahoo Local’s motorcycle parts in San Francisco returned 40 results. The small amount of results on A9 tells me that the search I entered obviously does not match their classification data because there should be many more listings. It also should be a flag to A9 that I need a little help at this point. Motorcycle parts is not so crazy a query that the engine can’t map it up to something like motorcycles, which returns 66 results because all of the listings are classified in the Motorcyles and Scooters category.

But I will take the initiative and refine my query to motorcycle repair, which has 33 results as well as displaying a matching category and breadcrumb trail: Autos, Boats & Vehicles > Motorcycles & Scooters > Repair & Service . And as readers of the Lounge know, I love my categories. Of the ten results on page one, eight had pictures of the shops, leaving two without photos. Next to the list of results is a map that shows the locations for each place. The search results page is all pretty straightforward with the displayed listings having addresses and phone numbers.

But things change after clicking on a listing’s link. The links take users to an Amazon interface and that is where the photos of the businesses are. There are some other things to see here — in fact maybe too many things because the page comes with the full Amazon frame — like a link to look at other businesses on the same street, maps and directions, customer reviews, a link to call the business over the Internet, and the ability for business owners to update listing information. But there is one big, gaping absence: there is no web site address listed for the business. There is a field called web site on each businesses’ update listing page, so hopefully they will get more active incorporating these as well as incorporating external web sites and external reviews.

So, now what about those photographs? They are certainly an attention-getting feature, and they are useful, but right now I am not committed to them being very useful. If you’re kind of bored, cruising around and looking at places can be fun. And if you’re really bored, assigning the best pictures to familiar businesses is like a scavenger hunt.

But there are some useful things about them. In my opinion, A9’s Yellow Pages is not mainly intended for general shopping information, it’s more for finding things like what a particular restaurant looks like. The photos are visual reminders not only of what a place looks like, but also how to get there, where to park, and what the physical space around a location is like. The photos can also be helpful for locating businesses that are in walking distance from each other or from a current location, particularly in a dense city like San Francisco where oftentimes there are many businesses within walking distance of each other. Also, the cameras captured real images. The images are not edited in any way, so they provide a real-life view, for better or worse. A couple of friends and I were checking out a business we used to work for and saw a photo of someone we knew walking in to the office to start her day. So in that way, it can be fun to play around with.

Conclusions
The data currently associated with businesses is very straightforward. There are phone numbers and addresses, but no web site URLs. So that is an obvious area for improvement. Since not every business currently has photos, I’m sure that’s another area being worked on. The results I got for searches were all relevant, but the engine could do more to help me with my searches. Things like spell checking and better category matching.

I would also like to be able to enter a street address and then work my way backwards from the map to the businesses on a particular street. If that feature exists, I couldn’t figure how to get to it. That would be particularly useful if I couldn’t remember the name of someplace I visited, but I remembered the intersection it was near. Right now you can sometimes game the system by searching for a street because there are often businesses that have the street name included in them, but it’s hit or miss.

The real question is, how useful is seeing a photograph of a local business? Obviously it has a coolness factor, but will it bring users back over time?

All in all, here’s my opinion. I think as it stands in its current implementation A9’s Yellow Pages is an intriguing search tool. It is unique because the photos they took are not found elsewhere. It is useful in reminding me what a place looks like on the outside, or so I can send a photo of a business to other people for review. It is also useful for checking out places I’ve never been to, just to get an external sense of what someplace looks like.

But, I think that A9 is cooking up something bigger and better for their photos than what we see now. I don’t know what exactly that’ll be, but maybe they’ll figure out a way to hook up series of photographs together to provide panoramic images, such as showing a broader picture of a whole block or area. Or, and this is the big one I’m really hoping for, maybe interior photos will become available so that I can see what it looks like inside and outside a restaurant or bar. In that case these external photos will act as of a lure to get local business owners and the general public to upload interior photos. I don’t really expect A9 to strap cameras on people’s heads and go into every business, but neither did I expect them to strap cameras on top of cars and drive around cities.

I don’t know what it’s all leading to, but I’m very curious to follow their path and see where they go with this.

Kartoo

Type of engine: Visual search.

SUMMARY
Relevancy of results:Needs improvement.
Freshness of results:Needs improvement.
Features and functionality:Good.
Quality of help and “about us” pages:Average.
Business model:They sell a variety of search packages that can be reviewed on their solutions page.

INTRO
Remember a few years ago when you were sitting at your desk one slow afternoon and you got an email about a cool new search tool called Kartoo? The sender wrote something like, “You’ve got to check this out. It’s really cool.” And you checked it out and you thought, “Yeah, that is cool.” Remember that day? I certainly do. And then remember what happened next? Every time someone asked you about a cool search tool, you said, “Have you seen Kartoo?”
For years, “cool search engine” has been synonymous with Kartoo. In fact, check out the Yahoo results for cool search engine. What is that in position #1? It’s none other than Kartoo.

But here is the big question: besides calling Kartoo cool, how often do you use it? If you’re like me, the answer is seldom. So I decided to take a closer look at Kartoo and investigate its usefulness beyond the cool factor.

Kartoo launched on April 25, 2002 and since then has gained a reputation as the visual web search engine. On Google, MSN and Yahoo, the search visual search engine returns Kartoo in the first position. Kartoo offers a unique experience because although they are another meta-search engine, their search is based on their “visual display interface”, as they call it. The visual interface uses Macromedia Flash, though, they do offer an HTML version as well.

UI AND FEATURES
Kartoo has a legend for their visual display so that you can identify things such as Sponsored Sites, recently updated pages, clusters, certain file types, and domain types (.org, .net, .com). These are all useful designations but it takes some practice to get used to them. There is a lot to look at on a Kartoo results page, or map as it’s called, and so it’s not easy to distinguish some of the subtle visual clues that are available.

Kartoo offers Boolean and other advanced search syntax. Take a look at their Key Tips page for more about this. They state that by adding a question mark at the end of a query, Kartoo interprets the query as a natural language query. But I tried what is the population of Scotland? and then what year did Kartoo launch? and did not find the results useful. The Kartoo interface is not built for natural language searches because you have to mouse-over each result looking to see if it has your answer. And even doing that I could not find the answers I was looking for.

There is a FAQs page that gives more explanations about their technology. One particular question I found interesting is this one:
Is KartOO technology more pertinent than other search engines?
“It often is but not always… In fact, KartOO technology analyses the words you are asking for and then decides to question the most accurate search engines….As to the notion of relevance: when you ask for the word “ray” for example, you may mean the sea animal or the light device. The results you obtain may therefore be accurate or totally irrelevant to what you are looking for.
What is significant about KartOO in such a situation is that this technology provides a map that summarizes all the various and possible topics so that retrieved sites are in fact grouped into a form of topical “family”. A list, i.e., a linear classification of search results, could not represent all the applications connected to a word like “nuclear” for example, and above all, a list could not display the links existing between the applications.”

In other words, Kartoo’s visual interface acts as a clustering engine because it lets users look horizontally across a broad selection of sites quicker than going through a linear list. As their example states, this can be particularly effective for ambiguous queries where the searcher is trying to understand various meanings of the search term. Though it is my opinion that if someone is searching for info about something like the planet Saturn they will type in Saturn the planet rather than the ambiguous Saturn. But in any case I am a big believer in clustering even though I also think engines can expect searchers to help them out a bit.

The search results are a bit slow, but they distract you by showing a neat looking genie who is deep in thought.

One thing that bothered me is that after I clicked on something in the map (that is to say, the search results page), I could not go back one step. I had to reload the original query. It’s really frustrating that I can’t take one step backwards. You can do so when you click on the Next Map link down in the lower right hand corner, there is an option to go back to Previous Map, but not if you click on a topic. There is a drop-down list of my recent searches so I can get back that way. However, the interface let me down because my queries were too long so I can’t tell the difference between “raymond chandler” “dashiell hammett” and “raymond chandler” “dashiell hammett” mystery because the words get cut off.

On the search results page if you mouse over the paper looking icons you’ll see the text summary for the site appear to the left. I should mention a small thing, but something I like. When I click on a site, Kartoo counts the number of times I click on it. That’s helpful with a visual interface so that I can quickly see the paths I have already traveled.

RELEVANCY EVALUATION
I’ve been reading, and really enjoying, some classic hard-boiled mysteries by the masters Raymond Chandler and Dashiell Hammett. I wanted to know what influence Hammett had on Chandler. What did Chandler think of Hammett’s stories?

I searched very generally, just using the men’s names, “raymond chandler” “dashiell hammett”. The first thing I did was to click on the two links that figure most prominently in the middle of the map. The first one is a Wal-Mart page selling a book called “Hard Boiled Mystery Writers: Raymond Chandler, Dashiell Hammett, Ross Macdonald.” OK, that is relevant to my search terms, but I was not looking for a shopping site selling a book about the writers. I was looking for information on the web about the writers. It turns out the second link is also a retail site selling the same book, only it’s from AddALL.com and it doesn’t have the three paragraph summary that the Wal-Mart site has.

Two of the results on the map were classified as articles, (in other words, there is a yellow line that connects the listings with the word “articles” implying that both listings are related to that topic) which sounded promising. One was in French and the other was from High Beam, but to view it I needed to sign up with them.

To summarize the remainder of my experience for this first search, each site I clicked on was either a shopping site selling the book I mentioned earlier or a page in French, with one exception: there was a detailed bibliography of Raymond Chandler that includes this nice quote, “Dashiell Hammett may have shown how mean those streets could be, but Raymond Chandler imagined a man who could go down those streets who was not himself mean.” Not exactly a detailed comparison, but a good quote nonetheless.

Over on the left side of the page is a list of twenty related topics. Normally I would acknowledge that my query isn’t very good and would refine or adjust it accordingly. But in this case I will use Kartoo’s related topics as my refinement.

Here are some of them:
Hardboiled mystery writers, Ross MacDonald, Auteur, Roman, Article, Amazon, Fiction, Hard, Library, Mystery, Writers, Book, Matthew

I clicked on hardboiled mystery writers. Doing so creates a new set of topics, some of which are good, such as detective fiction. Some of which are not so good, like isbn and featured.
I noticed that there are different Amazon country sites showing up, such as .UK. and .CA.. There are also other shopping sites like Overstock. Along with these shopping sites there are a couple of decent sites such as a Dashiell Hammett bibliography from a fan site that lists four books about the two writers.

I also noticed that even though I limited my search to English pages only, there were still French results.

Not having much luck with my Chandler and Hammett query, it is time to try a whole different user mission. Lately there has been a lot of news here in California about the resignation of the Secretary of State. So I decided it would be a good idea to learn more about just what exactly that position entails.
I searched for California secretary of state job functions. A cursory glance through the results shows a variety of suggested topics that are not quite relevant. There is logistique supply chain, California whitewater, vacation, features, and so forth. None of the related topics offered me anything useful.

Turning to the site results, there is a site about long distance phone rates, another about whitewater rafting, and a travel site. There is also a report written by the former Secretary of State in 2000. As with the related topics, none of these results are helpful to me.

I clicked on the next map and was shown some job listings, a hotel site, a vacation home rental site and so forth. Again, nothing to help me understand the job functions of this position.

I refined my search and entered in California secretary of state responsibilities. Again there are vacation rental listings, a cat breeder site, a computer store, insurance company, long distance phone service, and so forth. Again, nothing close to my user mission.

CONCLUSION
So what is going on here? I am seeing some obvious issues. One is that shopping sites are being boosted high in results. Although I can see why some shopping sites would be returned for my Chandler/Hammett search, these should have been a relatively small percentage of the result set. And for my Secretary of State search there should have been few if any commercial sites. It seems like the word California created a slew of false positives which would explain the vacation rental and whitewater rafting sites.

Another area I see for improvement is the related topics. By way of comparison, I tried California Secretary of State responsibilities on Clusty and got some relevant clusters, such as Kevin Shelleyand Office of the California Secretary of State. And in the list of results is this helpful site called State Executive Branch Overview that has a section called What are the Duties of the Secretary of State?

And lastly, the results are not fresh. There were too many results from several years ago appearing in my maps. I realize that Kartoo is a meta-search engine so they are relying on external indexes, but they should still be able to improve upon relevancy, topics and freshness of results from the engines they are pulling from.

So yes, there is no doubt that Kartoo is cool looking, but it really needs to create better topics and to return more relevant sites in order for it to be useful. Right now I consider Kartoo a novelty with great potential more than a really useful search tool. It may be that the Kartoo.com web search is just their way of getting attention to the search solutions that they are selling, but if that is the case I think it is all the more reason to improve upon the web search part of their business.

BrainBoost – Interview with Founder Assaf Rozenblatt

BrainBoost is a natural language search engine. Ask BrainBoost questions in plain English and you’ll get answers in plain English. BrainBoost is automated and uses no human editorial invention. The legend goes that BrainBoost was created by 24-year-old software programmer Assaf Rozenblatt. It took him a year to build it and he built it so that his fiancé could better do her college research.

For more information and analysis, check out the review of BrainBoost I did for the Search Lounge.

***
Hi Assaf, thank you so much for joining me here at the Search Lounge. I know you started BrainBoost, but what exactly is your role these days? And can you provide some more background about the size and structure of the company?

We are a very small team at the moment, with only a handful of developers.
We are still primarily focused on development, but we will be switching gears soon to the sales and marketing of our licensable AnswerRank technology.
I am still very hands-on with the software development and continually help improve the technology on an ongoing basis.

A big issue in Internet search is evaluating the trustworthiness of sources. This issue is amplified in BrainBoost because the answers are shown right on the search results page and do not require users to click through to investigate the trustworthiness of the source. For example, for the search what is the population of Scotland?, the first three answers are slightly different (5.2 million, 5.1 million, just over 5 million. Like I said, just a slight difference.) Maybe if you included a published/crawled date, would that help? Or some kind of page rank metric? Do you have any suggestions for how BrainBoost users should address this issue?

We are currently working on a PageRank like system to help identify trustworthy sources.

How do you evaluate the relevancy and quality of the results that are returned on BrainBoost? Do you have a formal process in place for doing this? And, what subjects or types of queries do you think BrainBoost is particularly good at? How about subjects or types of queries that need some improvement?

For QA, we compiled a database of common questions and manually researched the answers for each of them. We then run the questions through the BrainBoost engine, which in turn automatically goes out to find answers. Precision is then easily determined by comparing what percent of the automatically generated BrainBoost answers match our manually found answers

There really isn’t a question type that is problematic for us at this time.

BrainBoost is 100% automated, but would you consider blending BrainBoost’s technology with some editorial content or mapping of results for certain types of queries?

Extracting answers from unstructured documents is what really sets us apart from existing ‘Answer Engines’ like Ask Jeeves and the new MSN search. It’s a much trickier problem to solve, and we are going to continue focusing on it for the time being.

Can you provide any insight into how BrainBoost reformulates a query when it sends it to another engine? Any chance you might be willing to provide an example of how this works?

Query reformulation helps ensure search engines return web pages that most likely contain answers somewhere within them. A simple example: “what does NASA stand for” gets reformulated into “NASA stands for”. This simple reordering of words (and the conjugation of the verb) greatly boosts the likelihood that relevant documents are returned by the engines. With larger and especially multipart questions this can get very complicated.

There’s something I don’t quite understand about BrainBoost. I enter a search on BB; BB reformulates my query and sends the new query against other engines; the other engines provide results; BB gathers those results and ranks them. OK, so here’s the question: how does BB take a result from another engine and then show a different description (and title?) than what I would see on the other engine? Or am I missing a piece of the puzzle?

BrainBoost does not just display the results it gathers from other engines. It merely uses those results as it’s starting point. The core technology of BrainBoost is a system we call AnswerRank. The AnswerRank system is given a question and a collection of documents. AnswerRank then analyzes the documents line by line and automatically extracts the very best answers from those documents. The top few hundred search results from the popular engines are what we feed into AnswerRank. BrainBoost begins where the search engines leave off.

Does BrainBoost give a higher weight to certain sources? How about results from certain engines?

No, not at this time. All sources begin processing with an equal weight.

I’ve noticed that it matters if I don’t format my search like a question. Compare these two queries: population of Scotland vs. what is the population of Scotland?. Is that done on purpose?

BrainBoost pays close attention to all words in the question. The type of words you use and the order in which you use them determines what classification, or algorithm, BrainBoost will use to answer your question. Whereas most search engines ignore words like ‘what’, ‘where’, ‘when’ and ‘how’, BrainBoost very much relies on them. In this case, the wording of the two questions resulted in two distinct classifications.

Sometimes I see repeat phrases being displayed, such as for the query What is BrainBoost, the following phrase is repeated several times:
-BrainBoost is a Question Answering search engine.-
This probably is not too big a deal, and in fact it may even be a good thing because it shows agreement, but what is your opinion about it?

We chose not to filter out answers that provide the same information in slightly different ways. Like you said, it really does help with identifying agreement towards a specific answer.

I read some helpful information you posted about BrainBoost in a thread on Search Guild. You wrote: “BrainBoost classifies incoming questions into distinct categories. Classification enables BrainBoost to predict what lexical properties the answer will most likely contain.” Can you expound on this? Do you classify searches based on the subject or topic of the search? Or do you parse the query to look for clues in the phrasing of the search? Or…?

Its best to give an example: When asked “how long do cats live?” BrainBoost recognizes that the user is looking for sentences that quantify the answer in terms of years/months/weeks etc. Responding with an answer that talks about inches/feet/centimeters would not be very intelligent at all. BrainBoost has many dozens of these types of classifications, all of which help ensure suitable answers are returned.

It seems like I hear very little about BrainBoost. Are you purposefully trying to keep a low profile? Or might that change in the future? I like BrainBoost and since it is so easy to use I think a lot of other people would like it too.

Yes, we have been trying to keep a low profile. It’s given us the luxury of time we needed to perfect our AnswerRank system.

A considerable amount of time was also spent on packaging AnswerRank technology into a licensable software component that can be ‘plugged into’ any existing keyword-based search system, allowing for companies to add Question Answering to their existing in-house search.

What do you see as the current state of natural search engines on the web? Would you care to predict for us what the world of natural search will look like a couple years from now?

I think Natural Language question answering mixed with sophisticated personalization is the future of search.

Lastly, what is your favorite drink?

Triple Grande Latte

Assaf, thank you for your time. Is there anything else you would like to add?

Thanks for your time Chris.