Advanced Hadoop
We’ll cover extending Hadoop via custom input/output formats, how to implement the most common Big Data processing patterns using Hadoop, cluster performance monitoring & tuning, and best practices for testing Hadoop workflows.
Who Should Attend?
Java developers who want to learn more about how best to use Hadoop to solve real-world data processing problems.
Prerequisites
This course assumes basic knowledge of Hadoop. We recommend completing our Introduction to Hadoop course, or equivalent hands-on experience. Relevant work experience is also highly valuable, as students who arrive with real-world problems in hand will benefit from the instructor’s input on their specific issues.
Participants should be comfortable reading and writing Java code; familiarity with Bash will help.
Outline
- Extending Hadoop – Custom partitioning, comparators, and input/output formats
- Extending Hadoop Lab – Reading & processing a custom data format
- Common Patterns – Filtering, sorting, binning and joining data sets
- Common Patterns Lab – Implement a workflow that filters, sorts, and joins data
- Monitoring – Best practices for making sure your jobs are running properly
- Testing – How to write unit & integration tests to validate Hadoop workflows
- Optimizations – Common causes of performance problems and how to fix them
- Optimization Lab – Dramatically improve the performance of a typical workflow
- Summary