Clusty is a web search tool that clusters results. It is a meta-search engine so all results are pulled from other engines. Clustered results help make it easier to quickly get a good understanding of the broad range of information available for certain topics. Clustering also allows users the option to browse by clicking on narrower, broader, or related topics. Vivisimo, the company that makes Clusty, has been clustering results for a while, but recently they decided to switch their web search traffic to Clusty. Vivisimo.com still offers web search, but is really more of a corporate page that describes their enterprise search solutions.
Back in October I wrote a review of Clusty for the Search Lounge.
This email interview was conducted with Director of Marketing Saman Haqqi.
***************************
Although clustering algorithms have been in the information profession for a while, and even Vivisimo and a couple of other companies have been around for several years, why do you think clustering web results hasn’t really gained much attention with the general public until recently?
The benefits of a clustering engine over simple ranking engines are increasingly recognized by the world at large and are validated by the growing traffic at clusty.com and the adoption of the technology by AOL for its new public search site at www.aol.com. Now one in every eight web searches offers clustered results.
Documents can appear in multiple clusters. How does the clustering engine determine the maximum number of clusters a document should appear in? And, just out of curiosity, do you know the average number of clusters per document? (I’d guess between 2 and 3…)
On average, a document can appear in about 1.7 clusters which seems to be very close to Yahoo’s humanly indexed directory where a document can fall in about 1.6 categories.
I notice that Wikipedia results often show up high in the results. Is there any particular reason for this? Are certain sources weighted higher than others?
Since web searchers typically do not use tabs to focus search results, Clusty dynamically gives higher weight to some sources like wikipedia, shopping, weather sites, image databases etc. depending upon the nature of the query. Look for several new key matches to be introduced over the next few months.
I’m curious about cluster depth. The deepest I have seen clusters go is three levels, though generally they seem to max out at two levels deep. Is this something that is pre-set? Or might the depth increase over time as you continue to develop your clustering technology?
The depth of clusters is determined by the number of results in a cluster and the variety amongst them.
How does your company evaluate the relevancy of clusters being returned for searches? Is there a formal process in place for doing this?
We are the most exacting critics of our solution and are always working to improve the quality and relevancy of our clusters.
Vivisimo has a brief, but helpful, white paper called How the Vivisimo Clustering Engine Works. The white paper states that the clustering engine “Does not use a predefined taxonomy or controlled vocabulary…â€. Why is that? Wouldn’t it make sense in some cases to overlay the clustering engine’s results on top of subject taxonomies or ontologies? A blended version would allow for scale, while at the same time taking advantage of human categorization.
Clustering overlays well with existing taxonomies and pre-defined categories.
Although not specific to a question I asked, here is some general information about the power of clustering that was included in the response I received.
At Vivisimo we believe that Web searching needs to evolve beyond ‘ranking engines’ that simply list undifferentiated page results ranked by popularity, freshness and links – criteria that don’t do enough to make search results useful to searchers. Clustering lets users view results organized into categories like books organized neatly in bookshelves instead of being randomly piled on the floor. It allows users to quickly overview at least ten times as many results as they would with ranking engines where users rarely go beyond the first page.
Similar results are grouped together so users can not only focus on the result that matches their interest but can also see other similar results easily. A simple ranking engine would have such similar results spread over several pages. Finally, the clustering allows deeper results which would otherwise be buried in later pages to rise to the top.
Key features:
Clusty.com is an implementation of our search engine, clustering engine and content integrator.
The clustering engine processes the text in search result summaries to group the results into folders. This ensures that clusters are returned within milliseconds.
Clusters are ranked based upon an advanced algorithm that accounts for the ranking of results by the underlying search engines, number of results, frequency etc.
One Comment on “Clusty -A Brief Interview About Clustering”
Comments are closed.