Hadoop Architecture Features and Summary

Apache Hadoop empower the businesses to promptly win insight from huge quantity of structured and unstructured data. Apache Hadoop is a layered architecture to store and analysis massive amounts of data.

Hadoop comprised of two independent frameworks. Hadoop incorporated a master/slave architecture for both storage and processing.

HADOOP = HDFS (storage) + MR(processing)

1. HDFS : Hadoop Distributed File System

Hadoop Distributed File System does file system storage. HDFS is a stable distributed file system which aims in high-throughput access to data.

2. MR:   MapReduce

MapReduce does data processing. MapReduce MR is a framework that use the divide and aggregate programming prototype to achieve tremendous performance of distributed data processing.

Using architecture like Apache Hive, Apache Pig, the data files can be accessed. Apache YARN (stands for YET Another Resource Negotiator) is the dominant key feature among second generation of Apache Hadoop (version 2). Initially, YARN was designed for resource management. Later on, YARN portray as distributed data operating system for variety of Big Data operations. YARN is a cluster management approach and Data Operating System which helps to process, analyze and manipulate data stored in HDFS. Many analytics such as data filtering, comparison, aggregation and sampling were done. Such derived valuable data provides many communities perceptibility into various aspects of their business and track preference changes in customers sentiment on their product items, get ahold of them in timely fashion as the need comes and respond them promptly to emerging era.

Apache Hadoop is employed almost in every domain. Greater social networking site like Facebook, Amazon, Yahoo, Twitter, LinkedIn, Google generally use Hadoop for search, website clicks, social network and connection investigation, continuously exam the social media usage and billing the service usage activities, digital marketing mechanization, ecommerce shopping, mobile gadget location based commerce, data store of information from social networks, processing  the log files created using mobile/ web applications, market stock behavior analysis, image and video analysis, predictive modeling for brand-new medication and fraud detection/ prevention.

How to stream data from

A dataset is endlessly created or copied from many forms of source. Different manipulation and analysis are carried out on the dataset over period. Data needs to be treated sequentially and additively either on report-by-report basis or over sliding time windows. In the information transformation, the earlier methodology outpour the data in bursts manner. When the data processed sequentially on hard drive, the number of seeks can be reduced to read data. Depending on load of system and network traffic, the seek time can varies. Data Streaming provide constant bitrate above certain threshold. The data is streamed out from the drive by retaining the utmost throughput.

Most powerful data processing pattern

Files are expected to be large. Normally, the size of files may varies (say KB, MB, GB, TB and even more). Apache Hadoop Distributed File System (HDFS) is the storage system which designed in the way to split up the data into several blocks and reservoir it for future use. Apache HDFS is engineered around the notion that the intense data processing pattern: a write-once, read-many-times pattern.

Streaming data access in HDFS

To perform file operation like read, the application need to know the list of blocks to be read sequentially. Apache Hadoop Distributed File System is a user-space file system.  In such filesystem, all the processes are carried outside of the kernel that afford any user to create/execute their applications and use available system resources. Without modifying kernel code, the non-privileged users can create own file systems and earn quicker responses. Moreover, there will not be any possibilities of Kernel panic. Developers can do quick testing on the application that triggered the crash. Testing burden is greatly reduced in user-space file system. There is an distinct central name node which have an in-memory directory of location for all the blocks and their replicated copies repository across the cluster nodes. Locality of the block details are made visible to the running process. Each examination will make use of either partial or full set of data.  Reading time of entire dataset is much valuable in comparison to the latency of first record read.  On choosing Apache Hadoop Distributed File system for data streaming, the effective data access can be ensured with a little overhead needed to cache input for constant stream.

Advantage of Hadoop Pipes over Streaming

Apache Hadoop data streaming utilize UNIX standard streams as interface between Apache Hadoop arch and application. Any programing language can be used to read and write data via standard Input and Output. Data Streaming uses standard input and output I/O to network with Apache MapReduce MR codes. Apache Hadoop also afford MapReduce MR Application Programming Interface (API) to program Map and Reduce methods in any coding language. Apache Hadoop Pipe is C++ interface to Apache Hadoop MapReduce MR. Hadoop pipe employs socket as the channel using which job tracker connect with application where C++ map and reduce methods resides. Apache Hadoop Pipes has lessen 0overhead than Apache Hadoop Streaming.

Why to choose Heterogeneous storage media in Hadoop?

At the beginning of Apache Hadoop development, the homogeneous storage model was used for the data storage. With deliberative efforts, the choose of heterogeneous storage media came into picture to address the needs like optimizing data usage and reduce the cost depending on the data usage levels. In the cluster, each data node is set up with set of data directories. From Hadoop version 2, each data directory can be configured with preferred and opt storage types. Some policy for storage types was defined and applied on storing the directory and files.

For instance:

Sometime the data can be left unused for quite period of time and such data can be archived and repository. In such scenario, the archive storage type is preferable and economical.