Category Archives: Clustering Engines

Clusty -A Brief Interview About Clustering

Clusty is a web search tool that clusters results. It is a meta-search engine so all results are pulled from other engines. Clustered results help make it easier to quickly get a good understanding of the broad range of information available for certain topics. Clustering also allows users the option to browse by clicking on narrower, broader, or related topics. Vivisimo, the company that makes Clusty, has been clustering results for a while, but recently they decided to switch their web search traffic to Clusty. Vivisimo.com still offers web search, but is really more of a corporate page that describes their enterprise search solutions.

Back in October I wrote a review of Clusty for the Search Lounge.

This email interview was conducted with Director of Marketing Saman Haqqi.

***************************

Although clustering algorithms have been in the information profession for a while, and even Vivisimo and a couple of other companies have been around for several years, why do you think clustering web results hasn’t really gained much attention with the general public until recently?
The benefits of a clustering engine over simple ranking engines are increasingly recognized by the world at large and are validated by the growing traffic at clusty.com and the adoption of the technology by AOL for its new public search site at www.aol.com. Now one in every eight web searches offers clustered results.

Documents can appear in multiple clusters. How does the clustering engine determine the maximum number of clusters a document should appear in? And, just out of curiosity, do you know the average number of clusters per document? (I’d guess between 2 and 3…)
On average, a document can appear in about 1.7 clusters which seems to be very close to Yahoo’s humanly indexed directory where a document can fall in about 1.6 categories.

I notice that Wikipedia results often show up high in the results. Is there any particular reason for this? Are certain sources weighted higher than others?
Since web searchers typically do not use tabs to focus search results, Clusty dynamically gives higher weight to some sources like wikipedia, shopping, weather sites, image databases etc. depending upon the nature of the query. Look for several new key matches to be introduced over the next few months.

I’m curious about cluster depth. The deepest I have seen clusters go is three levels, though generally they seem to max out at two levels deep. Is this something that is pre-set? Or might the depth increase over time as you continue to develop your clustering technology?
The depth of clusters is determined by the number of results in a cluster and the variety amongst them.

How does your company evaluate the relevancy of clusters being returned for searches? Is there a formal process in place for doing this?
We are the most exacting critics of our solution and are always working to improve the quality and relevancy of our clusters.

Vivisimo has a brief, but helpful, white paper called How the Vivisimo Clustering Engine Works. The white paper states that the clustering engine “Does not use a predefined taxonomy or controlled vocabulary…”. Why is that? Wouldn’t it make sense in some cases to overlay the clustering engine’s results on top of subject taxonomies or ontologies? A blended version would allow for scale, while at the same time taking advantage of human categorization.
Clustering overlays well with existing taxonomies and pre-defined categories.

Although not specific to a question I asked, here is some general information about the power of clustering that was included in the response I received.
At Vivisimo we believe that Web searching needs to evolve beyond ‘ranking engines’ that simply list undifferentiated page results ranked by popularity, freshness and links – criteria that don’t do enough to make search results useful to searchers. Clustering lets users view results organized into categories like books organized neatly in bookshelves instead of being randomly piled on the floor. It allows users to quickly overview at least ten times as many results as they would with ranking engines where users rarely go beyond the first page.
Similar results are grouped together so users can not only focus on the result that matches their interest but can also see other similar results easily. A simple ranking engine would have such similar results spread over several pages. Finally, the clustering allows deeper results which would otherwise be buried in later pages to rise to the top.

Key features:
Clusty.com is an implementation of our search engine, clustering engine and content integrator.

The clustering engine processes the text in search result summaries to group the results into folders. This ensures that clusters are returned within milliseconds.

Clusters are ranked based upon an advanced algorithm that accounts for the ranking of results by the underlying search engines, number of results, frequency etc.

Mooter

Mooter

Type of Engine: Visual and clustering.
Overall: Good.
If this engine were a drink it would be…an Emu Export, it’s Australian, has a funny name, I’d never heard of it until very recently, and it’s a safe bet that you’ve never heard of it.

Intro
Mooter is a visual clustering search engine and I like it. They’re from Australia and have been live only about a year.

According to their Technology site, which actually provides some useful information about what they’re up to, “Mooter gets it results from its own spidering, and a unique index of websites. While we are growing, we are supplementing our index with metasearch, and comparing the results from various engines before applying our analysis algorithms.” This is an interesting statement and I’m not exactly sure what they mean by it. If I had to guess it sounds like they’re spidering other engines’ indexes to create their clusters. Is this different from what Clusty or Clush does? I’m not sure, but would love to know the answer. Please email me if you know. It sounds like they plan to generate an entire web index, but that could be wishful thinking.

UI and Features
For the most part I like their interface, it’s simple and almost cheesy, but somehow likable. The Overture supplied Sponsored Links are killing me though. When you click into a cluster, the Sponsored Links take up nearly half the screen; bad, very bad.

You can click “All Results” to get the full list of results. Mooter maxes out at 120 results, or at least I didn’t find any queries that produced more than that.

If you don’t like the first cluster you see, click on the “Next Clusters” icon (the icon needs some improvement; it looks like a cluster of red pimples) to see another cluster.

Query Example
For phrase searches, each word usually becomes a cluster. For the search “William Styron” one of the clusters was “William.” Not good, but then I clicked on the cluster link and the sites were indeed about William Styron, and not just any old William. But still a “William” cluster doesn’t really help me.

Conclusion
Even if the name of a cluster doesn’t sound relevant, the links contained therein were generally on target. So I’d say they’re getting the back-end organization of clusters correct, but what they need to do is improve their cluster names and concepts. Maybe more phrase matching rather than pulling out just single terms, as if I know what I’m talking about.

They could also make the visual part of their results more compelling. As it is right now, it almost doesn’t need to be visual because the visual part of it doesn’t add much beyond novelty (and even the novelty is wearing off as more Kartoo-style visual engines appear).

Musicplasma

Musicplasma
Overall: Average quality, yet still very enjoyable to play around with.
If this engine were a drink it would be…a mint julep. It’s not your everyday drink, but you’ll find it a sweet break from the norm.

Intro
Musicplasma is a music search tool that lets you discover music artists similar to ones you already like. Oh, and it’s visual, like Kartoo.

I’m not really sure how they determine similarities. If I had to guess I’d say they base it on an ontology of genres (rock, rap, etc.), and on mining something like Amazon’s “Customers who bought that, also bought these” type of functionality.
UI and Features
You can zoom in or out on clusters, thereby focusing or expanding your view of similar artists.

Clicking on the links – those ethereal lines – scrolls the page in that direction. Nice feature!
Clicking on other clusters will refocus the clusters around that artist.
The Design panel allows for changing colors and other appearances if you’re into that kind of thing.

Query Examples
Sometimes the clusters make total sense. Try a search for Guided by Voices and the closest cluster will be Robert Pollard, the lead singer who has done solo albums. Sometimes the clusters are a bit off. Try searching for David Byrne and for some reason Paul Westerberg – lead singer of the Replacements – comes between Byrne and the Talking Heads. I’m not saying that’ss incorrect, but my first reaction was surprise. It could be accurate that people who like David Byrnes’ solo stuff, which doesn’t sound much like the Talking Heads, might like Paul Westerberg, Warren Zevon and Roxy Music (all closer than the Talking Heads).
I noticed that powerhouses like the Rolling Stones and Neil Young shows up in lots of places. I searched for Prince Buster, the 60s ska pioneer, and there’s Neil’s cluster. A search for Bad Brains similarly showed the Stones lurking one link away. Now obviously Neil Young and the Stones have influenced tons of groups, but I’m not sure that Bad Brains should be one link away. Anyone know why that would be?

Conclusion
I’d like to know more about the links. Is one artist linked to another because they collaborated? Or are they linked because they play similar music? Or are on the same label?
OK, so it’ss fun to play with, but give me some song samples.
How about letting me type in more than one group so I can really focus in?
Focus by time period. I really like early Stones, when they sounded like, say the Small Faces, but I hate recent Stones, when they sound like, say crap.

Musicplasma is fun to play with, but it needs to be more practical. Take the visual music search engine and turn it into an audio search engine. If that’s too far-flung, then at least show more context on how artists are linked. But like I said, it sure is fun…

Clusty

Type of Engine: Clustering.
Overall: Good.
If this engine were a drink it would be…an Anderson Valley Oatmeal Stout. It’s thick and sweet but most people won’t ever get to taste it.

Intro
Clusty is a new search tool made by Vivisimo. It’s a meta-search engine so all results are provided by other search tools. Its distinction is that it clusters results so that you can refine your query by clicking on a more focused topic.

UI and Features
Right now in Web Search there’s a drop-down menu that allows you to cluster as follows:
Source – by engine.
Topic – the heart of Clusty which is clustering by subject.
URL – sort by .com, .org, etc. Also by country code and occasionally by a particular domain though it seems inconsistent as to when this shows up. This is an interesting feature, but I’m not really sure what to do with it just yet. I suppose if I were doing a search for government documents it might be useful to look only at .gov results.

Other tabs, such as News and Shopping, have different clustering drop-down options. I’ll let you explore each of these on your own.

Not only can you set which tabs are seen, but you can also customize which sources are searched. For example, for News searches you can choose Reuters, BBC, CNN or other news sources. Very nice.

If you click on News, Encyclopedia or Gossip, then Clusty will generate a page with related content on it. This is a helpful feature but the front-page of Clusty should let you know about it.

To see specific recall information for each engine that was searched, click on the Details link above the results.

Icons – in the search results you’ll see the following helpful icons:
New window – opens result in new browser window.
Show in cluster – this highlights which cluster on the left contains the site.
Preview – opens the site within Clusty’s search results page.

Query Examples
The more I played around with Clusty the more I liked it. For example, try searching for tickets to an event. I tried the query Black Rider tickets, as in the Tom Waits play, and thought the clusters were pretty good because it successfully showed me a selection of sites where I could buy tickets to the play.
Clusty currently errs on the side of higher recall for its clusters, so many of the clusters are irrelevant. This can be OK if it leads to discovery, but I think the major area Clusty can improve upon is tightening the relevance of the clusters. For my Black Rider tickets query one of the clusters was Game which meant nothing to me. Another ambiguous cluster was Your tickets ready. Both of these clusters were poorly titled and the results contained within were not very relevant. I know, it’s only one query example and you can’t judge from just one query. But feel safe that I’ve conducted many other queries and most of them have been similar to Black Rider tickets. There have been useful clusters as well as clusters that make you scratch your head.

Conclusion
Clusty clusters on a horizontal plane and that’s what they do well: they surface information which lets you scan the breadth quickly. However, I’d like to see them go a little bit deeper into the categories. I didn’t see any categories below the second level. In other words, you do your search, click on one of the categories, and then can click on more category and that’s all. I certainly don’t want them to get carried away by overloading the depth of categories, but I think a couple more layers would be helpful in some cases.
I really wish they’d chosen a better name. Why would you ever name a search engine something that rhymes with lusty? But nonetheless, they’re on to something here. I think clustering is a powerful tool that can let searchers discover similar topics or refine their queries. In the future I hope clustering engines will combine with human-created web directories.