Amazon Elastic MapReduce

This module takes Hadoop developers through the ins-and-outs of leveraging Amazon’s Elastic MapReduce (EMR) service to quickly process big data, with lower cost and less hassle.

Learn from a true expert – instructor Ken Krugler is an active architect and developer, and is the author of the Elastic MapReduce training videos found on Amazon’s Elastic MapReduce Training page.

Who Should Attend?

Hadoop developers who want to learn how best to use Amazon’s Elastic MapReduce service.

Prerequisites

This module assumes basic knowledge of Hadoop. We recommend completing our Introduction to Hadoop module, or equivalent hands-on experience. Relevant work experience is also highly valuable, as students who arrive with real-world problems in hand will benefit from the instructor’s input on their specific issues.

Participants should be comfortable reading and writing Java code; familiarity with Bash will help.

Outline

  • Getting Started – Signing up for an AWS account, generating a key-pair, and setting up an S3 bucket
  • Running Jobs – Creating, monitoring, and getting results from you EMR Job Flow
  • Clusters of Servers – EC2 instance types, pricing, and Hadoop cluster configuration
  • Dealing with Data – S3 architecture, pricing, and access control
  • Map-Reduce Lab – How to use a Hadoop Job Flow to analyze text from Wikipedia
  • Command Line Tools – When and how to use the EMR and s3cmd tools
  • Debugging Tools – Best practices for debugging EMR Job Flows
  • Hive & Pig – Creating, monitoring, and getting results from Hive & Pig Job Flows
  • Hive Lab – How to use a Hive Job Flow to analyze Wikipedia article data
  • Advanced Elastic MapReduce – Bootstrap actions, spot pricing and task groups
  • Summary