Portfolio
Case studies
by Solution
Java
Full Text Web Search Engine, with Clustered Architecture
DEVELOPMENT & INTEGRATION CASE STUDIES
Full Text Web Search Engine, with Clustered Architecture
Stakeholder
A leading provider of hosted solutions on the Internet
Full-text web search engine, with clustered architecture, which allows to operate up to billions of indexed documents, customizable relevancy evaluation and high performance index storage.
Business Challenge / Situation
Our customer had decided to build a Web-scale search engine without problems of currently existing search systems. The goal of the project is to address many issues, both in quality and scalability, by scaling search engine technology to extraordinary web growth.
With the current scale and growth of the World Wide Web, the importance of being able to search for and locate Web pages effectively and accurately is of prime importance. Currently, the only feasible way a searcher can locate a particular Web-based source is by using a Web search engine.
The generic large-scale search engines return results in the thousands, many of which lack relevance to the query; but searchers only tend to look at the first few results anyway, hence an accurate rank is critically important.
Creating a search engine, that scales even to today's web presents various challenge. Fast crawling technology is needed to gather the web documents and keep them up to date. Storage space must be used efficiently to store indices and the documents themselves.
The system has to keep local copies of documents retrieved from the Internet and has to have access to fast data storage. Full size of the document repository that contains all information about web pages including document header, archived document body, etc. is estimated as dozens terabytes.
Solution / our approach
Despite the importance of the large-scale search engines on the web, very little academic research has been done on them. We have investigated the issues of calculating relevance for large volume of data. As a result, we designed an architecture that can support novel research activities on large-scale web data.
The system is based on the following major concepts:
Distributed data processing
Due to the very fast Internet growth it is almost impossible to keep up-to-date index and perform thousands of query operations per seconds with one centralized server.
To solve this problem we have developed distributed data processing technology.
Each service can run on a dedicated computer as well as share a computer with another service. So that, all the components are isolated from each other and can be reused. Due to isolation of the components, this architecture also improves the project maintainability.
Search by meaning feature
Generally, this means the following search scheme:
- user enters a word to search
- search engine tries to find an entry for this word in a dictionary and along with the standard search result generates a list of possible meanings of the word entered.
- if the user selects any particular meaning, the system generates a refined search request, which consists of the word entered by the user and synonyms obtained out of the dictionary.
This scheme guaranties that while performing refined search, the system will select desired URLs first.
System Highlights
Multi-core server
That feature allows one system server to use several independent indexes. This feature might be useful for a wide range of applications, like for indexing many different independent sites from one-time search.
Scalability
Search engine can be scaled to any target system, from desktops to high-end computers. This is provided by distributed data processing architecture shown on screenshot below.
The solution consists of the following components:
- URL Server
- Crawler
- Search engine
- Indexer
- Search Engine core
- WEB front-end
- Thesaurus
- Thesaurus Editor
URL Server
handles information about all documents or, rather, URLs of documents in the system. It manages a simple, but very efficient URL database (that component called URL Repository, but it is a part of implementation of URL Server) based on hash tables with high performance rehashing algorithm.
The purpose of the
Crawler
is to retrieve documents from Internet and put them into the Indexer. Each Crawler keeps up to 20 connections open in each of the 20 open threads at the same time. This is necessary to retrieve web pages at high speed. At peak speed, the system can crawl over 100 web pages per second. To increase Crawler performance A thread-pool technique has been used to increase Crawler performance.
Search engine
consists of two logical modules:
Indexer
and
Search engine core
.
The first one manages Indexer database, the second - processes search requests. The Indexer is also responsible for re-indexing new documents retrieved by Crawler. The
Indexer database
is a simple, high-performance database intended to keep close to one million records.
WEB front-end
for the system is implemented in JSP. This module communicates with two others:
Search engine core
and
Thesaurus.
HTTP server is embedded in WEB front-end component, so no HTTP daemons are needed to start to use the system.
The purpose of
Thesaurus
is to provide meanings and synonyms for a given word, and to store relations between words. This information is used by front-end application to provide a search-refining capability. This capability drastically increases quality and relevance of search results.
Two dictionaries were implemented: gate to WordNet and own Custom Dictionary.
WordNet is an on-line lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory.
Thesaurus Editor
provides web front-end for managing Thesaurus and to edit Custom Dictionary.
Tools & Techniques & Environment
Java Servlets, XML, JSP, JNI, TCP/IP, WordNet lexical database by Cognitive Science Laboratory, Linux
Value Received
The customer received a system corresponding to the highest market requirements. The functionality of the designed system is on the same level with the world leading search engines, and the following facts show the advantages:
- The search engine can index up to 50,000,000 web pages
- Each Crawler instance process 20-30 web pages per second
- Each Indexer instance process 10-20 web pages per second
- The search system process a query faster than in 1 second on the index of 1 billion documents
More Case Studies
|