Big Data and Solr

This module shows how to apply the processing power of Hadoop to common data processing challenges encountered while creating Solr indexes.

We’ll look at common use cases for generating search indexes from big data, typical patterns for the data processing workflow, and how to make it all work reliably at scale.

We will explore in detail an example of processing web crawl results to create a faceted Solr search solution.

And you’ll learn how Solr can be used as a NoSQL solution, and how it compares to classic NoSQL projects such as Cassandra and HBase.

Who Should Attend?

Developers who need to process data at scale, where the end result is an index of the data suitable for search and/or data analytics.

Prerequisites

To get the most from this course you should have experience with Java, Hadoop, and developing Solr applications. We recommend completing both our Introduction to Hadoop module and LucidWork’s “Developing Search Applications with Solr” course. Relevant work experience is also highly valuable, as students who arrive with real-world problems in hand will benefit from the instructor’s input on their specific issues.

Participants should be comfortable reading and writing Java code; familiarity with Bash will help.

Outline

  • Overview – Generating Solr Indexes with Hadoop
  • Workflows – Connecting Big Data to Solr
  • Indexing – How to Quickly Build Big Indexes
  • Hands-on Lab – Generating a Word Co-occurrence Index
  • Data Analysis – Preparing Data for Solr
  • NoSQL – Using Solr as a Scalable Database
  • Big Data Example – 1 Billion Records in Solr
  • Summary