New breed in Data Architecture

Apache Hadoop cleverly incorporated the benefits of cost and convenience. These systematic processes help Hadoop to push its brand further in success ladder. Hadoop is neither alternative to traditional data systems like RDBMS nor substitute for a database. Hadoop may be considered as complement to RDBMS. Hadoop can smoothly interoperate with existing tool and work fine together with RDBMS. There is no relation table and key value pair present in Hadoop. Apache Hadoop MapReduce Framework apply key/value pairs as its basic data unit. Traditional fundamental tenet of relational databases structure is defined by a schema. Scaling commercial relational databases is costlier and restricted. Now-a-days, the digital process produce 20% structured (table) and 80% unstructured (image, binary, sequential, log, xml etc) data. Hadoop provided elegant approach to best run business and handle large volume of data sets (structured – Apache Hbase, semi-structured sequential/binary files and unstructured text). Hadoop is being used to build analytic applications in feasible manner. It is more flexible to manage and work with the less-structured data types in Apache Hadoop. High-level declarative language like SQL, black box query engine is equipped in Hadoop set up.

Mapping RDBMS vs MapReduce

Former RDBMS Newer Hadoop
Static Schema Dynamic schema
Many Read and write Read many and write once
Handle gigabytes of data Data size allowed Petabytes
Batch and Interactive data access Batch, Interactive, Streaming and Real time data access
High Integrity Low Integrity
Uneven scaling Even scaling
RDBMS cost between $100 to $100K per user TB Effective cost per user TB : $250/TB

Hadoop : Scale-out Shared-nothing Architecture

In a large-scale distributed cluster, coordinating the system is much challenge. By splitting the jobs, Hadoop Distributed File System HDFS and MapReduce MR make easier accomplishment of task in more cost-time effective manner. HDFS store data potentially and counting up more resources means adding more machines to the Hadoop cluster. MR can store, analysis and aggregate result. Data access can be quickened by collocating the data with the compute machine. MR use this locality data store for faster processing. Map and Reduce processes have no dependence on one other. While using many pieces of hardware, there is high chance that any one will fail. MapReduce design have reduced the programmer effort in correcting failure area. Hadoop architectural implementation identify faulty areas, reschedule with better replacements. Thus, Hadoop design will gracefully handle the partial and major failure. Replication is a common way to prevent potential service and data loss. Redundant copies of the data are kept across different system. In the event of failure, another duplicate copy made always available. Thus, network, node and disk failure are efficiently handled in Hadoop framework.