Introduction to Cascading

This module will teach you how to use the Cascading open source workflow API to create high performance, scalable, reliable and maintainable data processing solutions on top of Hadoop.

We’ll cover modelling problems using Cascading’s workflow graph approach, leveraging built-in operations, extending Cascading with custom operations, simple and complex grouping & joining, input/output using Taps and Schemes, and best practices.

Students will learn how to apply Cascading to a wide range of complex data processing problems.

Who Should Attend?

Hadoop developers who want to learn how to use Cascading to reduce development time (often by more than 75%), improve performance, and simplify complex data processing workflow development.

Prerequisites

To get the most from this module you should have experience with Hadoop. We recommend completing our Introduction to Hadoop module. Relevant work experience is also highly valuable, as students who arrive with real-world problems in hand will benefit from the instructor’s input on their specific issues.

Participants should be comfortable reading and writing Java code; familiarity with Bash will help.

Outline – 1/2 Day

  • Overview – Cascading from 40,000ft
  • Thinking in Cascading – Pipes, Tuples, Fields and Operations
  • Hands-on Lab #1 – Parsing Log Files & Merging Data Streams
  • Taps & Schemes – Sources and Sinks for Data
  • Operations – Functions, Filters, Aggregators and Buffers
  • Grouping & Joining
  • Real-world Workflow Design
  • Hands-on Lab #2 – Log File Analytics
  • Summary

Outline – Full Day

  • Overview – Cascading from 40,000ft
  • Benefits – Why Cascading?
  • Thinking in Cascading – Pipes, Tuples, Fields and Operations
  • Pipes, Tuples & Fields in Depth
  • Hands-on Lab #1 – Parsing Log Files & Merging Data Streams
  • Taps & Schemes – Sources and Sinks for Data
  • Operations – Functions, Filters, Aggregators and Buffers
  • Grouping & Joining
  • Real-world Workflow Design
  • Cascading Local Mode
  • Hands-on Lab #2 – Log File Analytics
  • Test-Driven Development with Cascading
  • Summary