Michael E. Byczek
Software Engineer


Apache Hadoop Framework

Apache Hadoop is an open source framework for distributed processing of large data sets (structured and unstructured). Files are split into blocks and distributed across nodes in a cluster the scale from one to thousands of computers.

Most of an average company's data is unstructured (i.e. emails and social media) that don't fit perfectly into rows and columns. Hadoop handles any file or format of data.

The framework consists of:
Hadoop includes a larger collection of services that include other tools, such as:
MapReduce
Example: multiple files contain high temperatures for the same cities on different days. The "map" job calculates the highest temperature in each file in distinct jobs for each file. Those individual jobs are fed into the "reduce" phases which combines all the high temperatures to arrive at a single high temperature for each city.

Spark
Pig
Hive
Mahout

Main