Beauty of Hadoop

Apache Hadoop is deliberately evolve into the go-to framework for extremely large, data demanding deployments. Apache Hadoop is an open source framework for storing and analyzing huge data sets on clusters created from commodity hardware . Apache Hadoop provide highly powerful centralized data structure allows many machines to contribute storage (HDFS)to the cluster and processing power (MR). Distributed framework for processing and storing data generally on commodity hardware namely distributed storage (HDFS-Hadoop Distributed FileSystem) and distributed processing (MR-MapReduce).Diverse organizations collaborate voluntarily and built Hadoop. No Agenda has been put-forth. Allowing Competing projects to permits evolution and prove its survival of fittest. Thus, the quality is emergent, continues evolving.

Hadoop powering the future of BigData

Apache Hadoop is completely open source and written in JAVA. Apache Hadoop can store any kind of data either structured or unstructured forms. Hadoop is designed to be a scale-out architecture operating on a cluster of commodity PC machines. Hadoop is initially used for batch processing and later version of Hadoop has enhanced to support real time applications. Hadoop framework provide data locality, the Code moves to data. Using MapReduce, the actual steps in processing the data can be specified and drive the output. Framework provides applications reliability, cost effective, adoption and data motion. HDFS handles auto hardware failure using chopped scattered data across cluster slave nodes. Data replicated: Default replication is three fold. For instance each block present in three different machines. Detect fault: failed task rerun and heals itself. Hardware failures are common and should be handled automatically by the Hadoop framework.

Hadoop Environment configuration

How to configure Hadoop? Hadoop can be set up in three modes

  1. Standalone: Hadoop application can be debugged in Standalone mode. Use local file for storage and all Hadoop daemons run inside a single Java process.
  2. Pseudo-distributed: Every Hadoop daemon runs in distinct Java Virtual Machine, as a separate process on a single machine.
  3. Fully-distributed: Hadoop powers parallel processing, data consistency, reliability, scalability, fault tolerance, replication and workflow management, independence of job execution are supported in the fully distributed mode.

Apache Hadoop is built to process large variety of data from terabytes to petabytes and beyond. It is difficult to fit much of these data on a single computer’s hard drive with much less in memory. Apache Hadoop is designed to process extremely large amounts of data by parallel connection of many commodity hardware together. Distributing the processing computation resolves the problem of having too big data to fit into a single machine.Using the MapReduce design, Hadoop can take a query over a dataset, divide and run sub task in parallel over multiple cluster nodes. All the modules in Hadoop are commonly configured in fully distributed mode.