NoSql Database Cluster Configuration for high Performance

posted in: Projects | 0

This post focuses on current commodity hardware. Its optimal setting for various types of use cases and how your choice of NoSql database can be configured to use it to achieve high performance.

Hardware Performance

Disk Drive Performance

For disk drives, most time consuming operation is seek times. From wikipedia, seek time of 7200 RPM drives is about 4 ms. So in worst case, disk drives can read about 2500 blocks per second. It needs at least 4-5 disk seeks (for database table with over 100K records, it will need at least 4-5 levels of btree) for Primary key Btree index lookup. So realistically, its performance is about 500 records per second with 5 seeks.

For sequential reads/writes (for append only databases or kafka like use case), 7200 RPM disk drive can do about 100 Mbytes per second. Or for 1KB record size it can read/write about 100,000 records per second.

SSD Performance

High performance SSDs today can do about 100,000 IOPs per second of about 4K size. Performance of SSD is almost same as sequential disk accesses.

It may be a good idea to use disk drives for append only files like redo logs or append type workloads like in kafka.

Object Serialization/deserialization Performance

Based on performance analysis done on this blog, it is about 250,000 serialization/de-serialization of 1,000 byte payload using ProtoBufs on Intel Intel Core i3-3240 (Ivy Bridge) 3.40GHz processor. It should linearly scale with more CPUs and cores as this operation is mostly CPU bound.

Typical hardware configuration

For cost and performance, we will use typical hadoop recommended configuration as our baseline.

#CPUs/Cores

Based on the price, power consumption, use case and rack size, try to get as many CPUs as possible per machines.

Its always better to scale up to certain limit before thinking about scaling out. Typical hardware with 8, 16 or even 32 CPUs is possible in todays commodity hardware configurations.

Just to put some perspective, typical memory hash map implementations like redis can do ~100,000 get/set per second per CPU. So most of NoSql workloads are going to be memory and disk bound (and not CPU bound) as throughput of disk based NoSqls is never going to be 100,000 queries per second per CPU.

Physical Memory

Try to get as much physical memory as possible within constraints of price, size and power consumption. It is possible to have memories ranging from 72 GB, 128GB, 256GB and 512GB. More the better.

Disk Drives and SSDs

SSDs beat disk drives in all respect except the size. Commodity hardware can have 14 disk drives of 2-4TB sizes, that amount to about 24 TB to 48 TB storage with disk.

Cost and physical slot size requirements stop SSDs to reach to that kind of storage per machine. Typical SSD storage could range from 1TB up to 10TBs.

Choosing hard drives or SSDs

Pure size perspective hard drives make sense. You will need to fill up 48 TBs before considering about sharding. But performance is a big bottleneck with hard drives. With 14 drives, you can have 14*500=7000 record QPS.

SSDs can do about 100,000 IOPS or 20,000 QPS for record (1/5th due to btree lookup). With 2-10 SSDs per hardware, one can expect about 20,000*10=200,000 QPS for records.

You can choose hybrid approach. Based on the use case some collections could be in SSDs and some in hard drives. For Example – If you are building E-commerce site then, use case usually behaves in a way that 80-90% of queries use hot records from memory. For example, there will be lot of queries on iPhone or standard items which could be in the memory. Or queries to get user preferences etc. In this case, it may be OK to use hard drives. For append only use cases like redo logs, kafka use disk drives.

RAID

For  random accesses use RAID 0 for performance. Most of NoSql databases provide redundancy, so redundant RAID configurations may not be required. But if you are using disk drives and data size requirements are going to be less than available size then you may consider using redundancy RAID configurations. At least single disk failures may not trigger node rebalancing.

For append only use cases, redo logs, kafka etc don’t use RAID. As these use cases rely heavily on the sequential access speed of disk.

Payloads

OLTP

OLPT use cases mostly tend to operate on single record. It is easier from scalability, extensibility and maintenance point of view to have schemas properly normalized. De-normalized schemas create maintenance issues as some future use cases may not be easy to implement if schema is not properly normalized. Classic example is, teacher changing her class timings. For normalized schema just one record in one table needs to change. But for de-normalized schemas, records of all students taking that class might have to be updated.

For OLTP workloads it is better to choose NoSql which supports relational type schemas and joins.

New wave of NewSql databases is picking up where they are focusing on “functionality of SQL/relational databases with scale of NoSql” mantra.

Data warehousing / Analytics

Use case in this case are mostly to insert records in fact tables and be able to join with dimension tables. Key here is dimension tables change very rarely. Sometimes these kinds of dimension tables are called slowly changing dimensions. For Example – Walmart adding support for new product or adding new store.

Due to their insert only nature, their size can go infinitely. Because of this proper sharding of data is a absolute must. Also, due to their append/insert only nature, it is ok to de-normalize schema up to certain extent. Schema de-normalization, doesn’t mean joins are not required. In these payloads as well, you still will need a join support for joining with dimension tables. For Example – To count number of products sold in Walmart store A from Cereal section in bottom rack. In this case there are 3 dimensions used; Store dimension, Section dimension and Rack dimension. Rack and Section dimensions may be hierarchical and may not be easily mapped to some metadata to avoid joins. In this case it may be best to use joins (and we will be better off using database supporting joins) rather than name-value or document databases which don’t support joins.

Bottom line is, data warehousing type of payloads need de-normalized data schema with shardable, horizontally scalable databases, but it may not be a bad idea to select latest type of NewSql databases which are highly scalable and support joins and other relational database constructs. Google released datastore which is highly scalable and also support some basic SQL like queries and joins.

Real time analytics

In this case, data from data warehousing is fed back to OLTP databases (or other databases which can be queried in real time) so that they can use this data to serve user better. For Example – Sites providing with recommendations based on user current activity.

This could be huge amounts of mostly read only data (data might change once a day, this change could be of insert, update or delete type) fed from data warehousing which will be queried by realtime use case. For Example – Say user is watching some movie; based on that user is provided with recommendations. Recommendations engine in this case uses both real time OLTP data (to get current user activities) and data warehousing data to use in recommendations algorithms to come up with best possible recommendations for that user.

NoSQL Databases

Due to scale and size of data, any selected NoSql database should support sharding and replication.

Most of current wave of top NoSql databases are mostly document or name/value and not completely designed to take full advantage of current commodity hardware specs.

MongoDB

use RAID 0 for collection and index files.

MongoDB architecture almost forces use of SSDs which limits maximum storage capacity per node. Which further forces going for sharding when other resources like CPU or memory haven’t reached their peak capacity.

Due to its memory mapped files and table locking for writes, it cannot use commodity hardware stack to its full capacity. Sharding further causes, hardware, operational, maintenance costs sky rocketing along with fragile structure of cluster since even sharding cannot completely solve underlying architectural issues.

On top, their loose (document) schema model comes in a way of future enhancements as due to need for atomicity, schema de-normalization is forced which in turn increases the document size and eventually suffers the performance and other issues crop up due de-normalized schema.

Bad queried can further degrade cluster performance since not frequently used records can clog up physical memory due to memory mapped files.

Cassandra

Cassandra read and writes could be sequential (their architecture document is not clear if reads are sequential). SSTables are written to disk sequentially. They are read periodically sequentially to merge in to  one SSTable.

Even though reads and writes are sequential, overtime due to increasing file count disk fragmentation can occur. They loose speed of sequential accesses which will show up in the performance and throughput.

Based on this information, we can assume that Cassandra node may not use RAID and should not need to use SSD.

Couchbase

Couchbase is document database like MongoDB so de-normalization issues discussed in MongoDB also apply to Couchbase.

Couchbase is classic example of append only database where every insert/update/delete is appended to datafile. And Compaction keeps compacting data files for every bucket.

If the data size is bigger than the physical memory then, random access reads may be required to fetch the record.

Since performance of disk drives and SSD is same for sequential accesses writes can use either SSD or disk drives. For reads SSD will outperform hard drives due to their random access nature.

So based on use case and data size you may have to choose between SSD and disk drive.

  • If physical memory is about same as data size per node then use disk drive. I would say data size bigger by up to 50%, use disk drive.
  • If your use case is such that 80-90% of queries use hot records from memory then use disk drive.
  • Any other case, use SSD, but note that you might have to start sharding earlier than you wanted.

Kafka

Even though Kafka is not a NoSql database, it shows up in most of the NoSql cluster databases.

Its payload is mostly append only disk drive type with not much CPU needs. It usually needs as many disk drives as possible with no need for RAID.

What is better? Unsafe, ByteBuffer or Direct ByteBuffer?

posted in: Projects | 0




Summary

Unsafe memory allocation is about 10-12% faster than ByteBuffer and Direct ByteBuffer. Since it lives out side of JVM heap, application needs to manage its allocation and deallocation.

There are couple of JAVA runtime options to consider as well which might help boost performance of ByteBuffer.

Conclusion is, consider all 3 options listed below and see which option is better for your specific use case. We are working on open sourcing a wrapper library which will wrap Unsafe buffer to java ByteBuffer so that they can used interchangeably in the code.

In WonderDB, we are going to support all 3 options. Since most of the our cache memory is allocated at the beginning and destroyed at the end, we really have no issues with application requiring to manage memory in case of Unsafe memory allocation.

Our Use Case

WonderDB is a NoSql transactional database built on top of relational database architecture and design concepts. Our caches are fully disk backed and we manage life cycle of data buffers by paging in and out of disk files. Currently we have built our caching engine on top of direct ByteBuffer and now in the process of evaluating other approaches for performance improvements.

Different ways to store bytes in to memory

We are considering following options:

  • ByteBuffer
  • Direct ByteBuffer
  • Unsafe

Most of access to cache in WonderDB is to read and write serialized byte arrays.

I found detailed performance analysis of all above three methods in Alexey’s blog. Unsafe read/write performance for byte arrays is about 10-12% better than ByteBuffer or Direct ByteBuffer.

Even though Unsafe memory allocation is faster, we believe, there are couple of JVM runtime options which should help ByteBuffer allocation.

JVM runtime options to consider

-XX:+UseLargePages

Most of the major operating systems support LargeMemory page settings. This setting improves performance due to following reasons.

  1. increased performance through increased Translation Lookaside Buffer (TLB) hits
  2. pages are locked in memory and are never swapped out which will guarantee whole JVM heap remains in RAM. Same guarantee could not be given for Direct ByteBuffer memory.
  3. contiguous pages are pre-allocated and cannot be used for anything else but for System V shared memory, for example JVM heap.
  4. less bookkeeping work for the kernel in that part of virtual memory due to larger page sizes

Only ByteBuffer can take advantage of this jvm setting. It does not help direct ByteBuffer or Unsafe buffers since they are are allocated off heap and out of control of JVM.

More information about LargeMemory pages can be read here in redhat site.

Please note that LargeMemory pages need to be enabled on the machine before using this option. Please click Oracle site here to understand how to enable Large pages in different operating systems including Linux, Windows and Solaris.

The way this option can be used is, set the size of Large Pages little bigger than the size of java heap size you are considering. This way, whole java heap will be pinned to the memory and os wont page in and out its contents.

-XX:+CompressedOops java option

This option allows pointer compression in 64bit JVM to reduce the heap.