September 7th, 2005

Inside the Google Mini Search Appliance


By Alice Hill
RealTechNews

My company has been eying the “mini” ever since it sported a “maxi” price tag. But now with a 30 day free trial and a $3,000 price tag, our IT guy placed an order and the box arrived this week. Lucky for us, the gang over at AnandTech have already disassembled it, took great shots of every chip and software screen, and wrote up the whole how-to. Thanks guys. I will post my own experiences with the Mini when we have ours up and running. Meanwhile, read on:

Configuring your first Collection
“Like most any search product, the first task is to create a collection of what you want searched. The Google Mini supports one collection while its larger brother, the “Google Search Appliance, supports an unlimited number of collections. Collections can contain sub-collections (which I’ll explain a bit later).

“Once you’ve created your first collection, the first step is to edit the collection parameters and set it up for indexing. URLs to Crawl was where we started, which contains a few parameters, Starting URLs from which to crawl, Follow and Crawl certain URLs or parts thereof and Do Not Crawl URLs matching certain patterns. This was probably where we spent 99% of our time configuring the Mini. The mini allows for 100,000 documents/URLs to be stored in a collection, and AnandTech contains approximately 40,000 articles, news and blog entries.

“When we first set up the Mini, we told it to start in each of the website’s sections (for example, http://www.anandtech.com/it/) and in the web news area. The Mini considers any unique URL string to be a unique document, which makes sense (but is a bit surprising the first time that you run an index).

“After four hours of indexing, the Mini had managed to reach its document limit and we had to improvise. After several attempts at filtering out various URL patterns and restricting the crawling as much as we could, we ended up writing some code. We created a file to which a link to every article, news post and blog post that have been published on the site would be dumped. That file is cached for a few hours as we update the index 3 times a week. We then configured the Mini to start at those URLs and restricted it only to URLs ending in showdoc.aspx, shownews.aspx and a few others. It worked - the next index was around 38,000 documents. A word to the wise: don’t let the Mini crawl your entire site without keeping a close eye on it.”

Read the Complete In-Depth Look at the Mini Here
(via OhGizmo, one of our contributors!)

Share and Enjoy:These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Fark
  • NewsVine
  • Reddit
  • YahooMyWeb
You can leave a comment, or trackback from your own site. RSS 2.0

One comment to "Inside the Google Mini Search Appliance"

  1. Lockergnome's Tech News Watch says:

    Inside the Google Mini Search Appliance

    My company has been eying the Google “mini” search applicance ever since it sported a “maxi” price tag. But now with a 30 day free trial and a $3,000 price tag, our IT guy placed an order and the box arrived this week. Lucky for us, the gang over a…

    September 7th, 2005 at 8:29 am

Leave a comment