Speedy Search With Big Data – Using Amazon Cloud Search And NoSQL Database Combo

  • Post published:June 25, 2015

Search feature with advanced search option and deep search filters is a tough requirement to meet for a project that contains data records of Trillions of users. vteam #455’s client was in dire need of a similar requirement. In this project, team had to deal with trillions of users/resources data records associated with file attachments. Relational databases and usual servers were not efficient enough to deal with users’ need of quick response from search.

Cloud computing refers to the practice of dividing computational work load or data storage to multiple computing nodes, that allows the application software to be operated by dividing its computational overhead in several devices (Nodes) under a cloud. We usually preferred to use clouds when we have to achieve high performance and low latency while we have a lot of operations to execute to get the desired results.

On the other hand, NoSQL is developed in response to a rise in the volume of data and frequency where this data is accessed, performed and processed. Relational databases were not designed, neither to handle big data and frequent access nor were they built to take advantage of the cheap storage and processing power.

vteam #455 had handful of solutions but all of them were associated with many drawbacks as well. To fulfill the requirement, we combined cloud computing and NoSQL database to implement the required search feature.

Cloud computing enhanced the computing capability by diving work load under different nodes when many users were trying to search at the same time and as we had a massive data set, NoSQL was really an efficient database structure. We removed the relational join’s cost by using de-normalized database structure and allowing object/array data-type in one cell. This made search very easy. On top of all this, NoSQL also enabled us with advanced indexing and caching to return a real quick response.

We certainly used Amazon Cloud Search Service rather than worrying about hardware and software provisioning, setup, and maintenance. Amazon Cloud Search Service is a very much cost effective, fully-managed service in the cloud. It enables to search large collections of data with filters.

Challenge:

In this case, the biggest issue was that we were not able to run the application along with search feature completely on NoSQL. Just because, at this stage it was a hell of work to convert the whole application from relational database to NoSQL database first and then configure the whole application to work with new structure.

Solution:

To resolve this problem, we kept the application on relational database but some syncing crons were coded to upload real time data on Amazon Cloud Search domain. Its domain not only served the search feature but also kept the application running with its old relational database (without any change in structure or flow).

There were many sub-tables in relational database for managing meta-data for every record. We de-normalized them all into one table by using object/array datatype of NoSQL database structure and only one row was representing the whole record that reduced joins cost.

There were files attached to every record item as mentioned above, so search keyword from attachments was also required. For this, we converted all text supported files to JSON by coding a conversion tool, grabbed text data from converted files and placed in a cell against its respective record in Amazon Cloud Search domain.

Another issue was encountered due to size limits on Amazon while uploading file’s data. To resolve this, we coded a ranking tool to grab the keywords from files and removed all repeated words so that we could meet size limits of batch files for uploading data.

Milestones Achieved:

Following milestones were achieved while implementing this solution with Amazon Cloud Search Service:

  1. Configured cloud nodes and hardware capacity in Amazon Cloud
  2. Created a database on Amazon Cloud Search which is called Search Domain. It was the space that holds large volume of database
  3. Coded Cron jobs to prepare batch files to upload data from relational database to Amazon Cloud Search domain. Cron jobs were using files conversion and keywords rating tools to grab text from text rich attachments against every record items. These Cron jobs were also synced with real time data for updates as well
  4. Data indexing was performed on database level
  5. Search queries were created in the algebraic notation for retrieving search results based on keywords and search filters were used by users then
    .

Results:

The results from a relational database for a specific search query took on average of 20 secs. Search results from NoSQL in Cloud returns data around 10x faster – on average of 1 to 2 seconds only.