Advanced Cascading

This module covers topics that are commonly encountered by developers applying Cascading to larger-scale, more complex real-world data processing problems.

We’ll cover error handling with Traps, optimizing Flows, creating custom operations, best practices, effective use of SubAssemblies, and monitoring Flows.

Who Should Attend?

Java developers who want to learn more about how best to use Cascading to solve real-world data processing problems.

Prerequisites

This module assumes basic knowledge of Hadoop and Cascading. We recommend completing both our Introduction to Hadoop and Introduction to Cascading modules. Relevant work experience is also highly valuable, as students who arrive with real-world problems in hand will benefit from the instructor’s input on their specific issues.

Participants should be comfortable reading and writing Java code; familiarity with Bash will help.

Outline – 1/2 Day

  • Custom Operations – Creating your own Functions, Filters and Buffers
  • Hands-on Lab #1 – Extending Cascading
  • Optimizing Workflows – Common Techniques
  • Hands-on Lab #2 – Optimizations
  • SubAssemblies & Cascades – Reusable Components, Reliable Workflows
  • Failure Traps – How to Handle Bad Data
  • Debugging & Monitoring Workflows – Best Practices
  • Hands-on Lab #3 – Trapping Bad Data, Modularizing a Workflow
  • Summary

Outline – Full Day

  • Custom Operations – Creating your own Functions, Filters and Buffers
  • Custom Types – Beyond Primitive Types in Tuples
  • Hands-on Lab #1 – Extending Cascading
  • Hadoop Integration – Data Interchange, Streaming Jobs
  • Optimizing Workflows – Common Techniques
  • Optimizing Hadoop – Tuning Hadoop Job Settings
  • Hands-on Lab #2 – Optimizations
  • SubAssemblies & Cascades – Reusable Components, Reliable Workflows
  • Failure Traps – How to Handle Bad Data
  • Debugging & Monitoring Workflows – Best Practices
  • Hands-on Lab #3 – Trapping Bad Data, Modularizing a Workflow
  • Summary