Bitstudios Logo

Custom Development & Integration Case Studies

Internet Service Provider  
 
Full Text Web Search Engine, with Clustered Architecture

Stakeholder

A leading provider of hosted solutions on the Internet

Full-text web search engine, with clustered architecture, which allows the operation of up to billions of indexed documents, customizable relevancy evaluation, and high performance index storage.

Business Challenge / Situation

Our customer had decided to build a web-scale search engine without problems of currently existing search systems. The goal of the project is to address many issues, both in quality and scalability, by scaling search engine technology to extraordinary web growth.

With the current scale and growth of the World Wide Web, the importance of being able to search for and locate web pages effectively and accurately is of prime importance. Currently, the only feasible way a searcher can locate a particular web-based source is by using a web search engine.

The generic large-scale search engines return results in the thousands, many of which lack relevance to the query; but searchers only tend to look at the first few results anyway. Hence, an accurate rank is critically important.

Creating a search engine that scales even to today's web presents various challenge. Fast crawling technology is needed to gather the web documents and keep them up-to-date. Storage space must be used efficiently to store indices and the documents themselves.

The system has to keep local copies of documents retrieved from the Internet and has to have access to fast data storage. The full-sized document repository that contains all information about web pages including document header, archived document body, etc. is estimated as dozens of terabytes.

Solution / our approach

Despite the importance of the large-scale search engines on the web, very little academic research has been done on them. We have investigated the issues of calculating relevance for large volumes of data. As a result, we designed an architecture that can support novel research activities on large-scale web data.

The system is based on the following major concepts:

 

Distributed data processing

Due to very fast Internet growth, it is almost impossible to keep an up-to-date index and perform thousands of query operations per seconds with one centralized server.

To solve this problem, we have developed distributed data-processing technology.

Each service can run on a dedicated computer as well as share a computer with another service, so that all the components are isolated from each other and can be reused. Due to isolation of the components, this architecture also improves the project maintainability.

Search by meaning feature

Generally, this means the following search scheme:

  • User enters a word(s) to search
  • Search engine tries to find an entry for this word in a dictionary, and in addition, the standard search result generates a list of possible meanings of the word entered.
  • If the user selects any particular meaning, the system generates a refined search request, which consists of the word entered by the user and synonyms obtained out of the dictionary.

This scheme guaranties that while performing refined search, the system will select desired URLs first.

System Highlights

Multi-core server


This feature allows one system server to use several independent indexes. This feature may be useful for a wide range of applications, like for indexing many different independent sites from a one-time search.

Scalability


Search engine can be scaled to any target system, from desktops to high-end computers. This is provided by distributed data-processing architecture shown on the screenshot below.

The solution consists of the following components:

  • URL Server
  • Crawler
  • Search Engine
    • Indexer
    • Search Engine Core
  • WEB Front-End
  • Thesaurus
  • Thesaurus Editor

 

URL Server

This handles information about all documents, or rather, URLs of documents in the system. It manages a simple but very efficient URL database (that component is called the URL Repository, but it is a part of the implementation of the URL Server) that is based on hash tables with high performance rehashing algorithm.

 

Crawler

The purpose of this is to retrieve documents from the Internet and put them into the Indexer. Each Crawler keeps up to 20 connections open in each of the 20 open threads at the same time. This is necessary to retrieve web pages at high speed. At peak speed, the system can crawl over 100 web pages per second. To increase Crawler performance, a thread-pool technique has been used to increase Crawler performance.

Search engine

consists of two logical modules:

Indexer

and

Search engine core

.
The Indexer manages the Indexer database and the search engine core processes search requests. The Indexer is also responsible for re-indexing new documents retrieved by Crawler.

Indexer database

This is a simple, high-performance database intended to keep close to one million records.

WEB front-end

For the system, this is implemented in JSP. This module communicates with the search engine core and the thesaurus. The HTTP server is embedded in WEB front-end component, so no HTTP daemons are needed to start to use the system.

Thesaurus

The purpose of this is to provide meanings and synonyms for a given word and to store relations between words. This information is used by front-end application to provide a search-refining capability. This capability drastically increases the quality and relevance of search results.

Two dictionaries were implemented: gate to WordNet and own Custom Dictionary.

WordNet is an online lexical reference system whose design is inspired by current psycholinguistic theories of the human lexical memory.

Thesaurus Editor

This provides web front-end for managing Thesaurus and to edit Custom Dictionary.

Tools & Techniques & Environment

Java Servlets, XML, JSP, JNI, TCP/IP, WordNet lexical database by Cognitive Science Laboratory, Linux

Value Received

The customer received a system corresponding to the highest market requirements. The functionality of the designed system is on the same level with the world-leading search engines, and the following facts show the advantages:

  • The search engine can index up to 50,000,000 web pages
  • Each Crawler instance processes 20-30 web pages per second
  • Each Indexer instance processes 10-20 web pages per second
  • The search system processes a query faster than in 1 second on the index of 1 billion documents


 
     

Other Custom development CASE STUDIES in Internet Service Provider  
   
     

 

*
*
*
*
Required fields
We respect your privacy. Guaranteed!

CLIENT TESTIMONIALS

BIT Studios stepped in and did a marvelous job... Most impressively, BIT Studios made a variety of suggestions above and beyond the requirements...

Paul Johansen
Senior .Net Architect, Hewlett Packard