Hadoop is an open-source framework of programs that is used to store and process big data. Hadoop uses multiple clusters of computers to analyze big data sets in parallel. The distributed processing of data sets can be scaled from single servers to multiple servers. The Hadoop library is designed in a manner to detect and negate failure at the application layer level and doesn’t depend upon any hardware for delivering high availability.
Hadoop is important for
- Storing and processing huge amounts of datasets in parallel quickly
- It allows to scale infinitely just by adding nodes
- It allows complete fault tolerance as the tasks are immediately routed to other nodes in case of failure of some nodes. It is more so possible because multiple copies of data are stored.
- It allows you the flexibility to store data in any formats be it text, videos and images
The four main components of Hadoop are
- Hadoop Distributed File System (HDFS)
- and libraries
Hbase is a distributed, column oriented, horizontally scalable big data store built on Hadoop distributed file system. It is modelled after Google’s Bigtable and written in Java.
Features of Hbase
- It has strong consistency for read/write which implies that you will get real time data in a read operation.
- It allows for horizontal scaling. As the table size increases and can’t accommodate data, it is auto shraded and distributed to multiple machines in cluster.
- It can be coupled with MapReduce. The Hbase table can act as the source or the sink of the MapReduce job
- It helps to store sparse data in fault tolerant manner
- It serves the need to read/write data in real-time
Hbase has features like
- Bloom filters and
- In-memory operations
Differences between Hadoop Distributed File System and Hbase
|Java based file system||Hadoop database based. Java based, No SQL database|
|Has a static architecture||Allows dynamic changes. Can be even used for standalone applications|
|Preferred for batch processing offline||Preferred for real time processing|
|High latency for operations||Low latency to small amounts of data|
|Ideally suited for write once and read many times sequentially||Suited for random write and read of data|
|Complete fault tolerant||Partially fault tolerant|
|Accessed through MapReduce Jobs||Accessed through Java API, Rest, Avro and Thrift APIs|
|Data stored in chunks||Data stored in key value pairs|
|Inexpensive when massive amounts of data are being processed||Specifically used in random data access|
|Hive performance with HDFS is excellent||Hive performance with Hbase is four to five times slower|
|Maximum data size is 30+ petabytes||Maximum data size is nearly 1 petabyte|