Advanced Hadoop

This module covers topics that are commonly encountered by developers applying Hadoop to larger-scale, more complex real-world data processing problems.

We’ll cover extending Hadoop via custom input/output formats, how to implement the most common Big Data processing patterns using Hadoop, cluster performance monitoring & tuning, and best practices for testing Hadoop workflows.

Who Should Attend?

Java developers who want to learn more about how best to use Hadoop to solve real-world data processing problems.

Prerequisites

This course assumes basic knowledge of Hadoop. We recommend completing our Introduction to Hadoop course, or equivalent hands-on experience. Relevant work experience is also highly valuable, as students who arrive with real-world problems in hand will benefit from the instructor’s input on their specific issues.

Participants should be comfortable reading and writing Java code; familiarity with Bash will help.

Outline

  • Extending Hadoop – Custom partitioning, comparators, and input/output formats
  • Extending Hadoop Lab – Reading & processing a custom data format
  • Common Patterns – Filtering, sorting, binning and joining data sets
  • Common Patterns Lab – Implement a workflow that filters, sorts, and joins data
  • Monitoring – Best practices for making sure your jobs are running properly
  • Testing – How to write unit & integration tests to validate Hadoop workflows
  • Optimizations – Common causes of performance problems and how to fix them
  • Optimization Lab – Dramatically improve the performance of a typical workflow
  • Summary