Spark in Scala
December 21, 2022 2022-12-21 16:36Spark in Scala
Spark in Scala
This course takes advantage of the Scala programming language to make the most of the Spark framework. It offers a deep dive into distributed programming with Apache Spark and Scala, explaining the nuts and bolts of the Spark computational model, and the intricacies of the Spark APIs. In particular, the course provides insights on how to analyze program performance using the Spark UIs, and how to solve common optimization problems through practical exercises.
Audience:
- Non-Scala programmers willing to jump into the Scala bandwagon and make the most of the Spark framework through Scala
- Spark programmers in Java or Python willing to start using the framework using Scala
- Big data programmers in Spark interested in improving their skills in the framework and getting a full understanding of performance problems and their solutions
Course Topics
- Topics: Spark, Catalyst, Data Frames, Scala, Optimization
- References: https://www.oreilly.com/library/view/spark-the-definitive/9781491912201/
What students will learn
- Feeling comfortable using Spark with Scala
- Getting acquaintance with the features of the Scala API of Spark
- Being able to understand Scala signatures made up from generics, implicits, etc.
- Getting used to the strongly-typed discipline of Scala
- Introduce the basic FP techniques needed to develop efficient and modular ETLs in Spark
- Identify and resolve common problems in Spark, with a special focus on performance
-
Module 1: Spark features I: the computational model (4 hours)
The basic principles and concepts behind Spark, as a framework for distributed processing.
-
Module 2: Spark features II: Spark APIs (4 hours)
How do we process massive distributed data sets in a cluster? With high-level APIs! We have two major alternative APIs available at your fingertips: statically typed and dynamically typed.
-
Module 3: Spark features III: Reading and writing in Spark (4 hours)
The last module focused on transformations, whereas this one focuses on the data side: formats, optimizations, management, etc.
-
Module 4: Spark optimizations (4 hours)
Learn how to take the most of the spark optimizations for free.
-
Module 5. Best practices on performance & modular design (4 hours)
Learn the best ways to optimize and organize your Spark code to make it more robust and performant.
- Partitioning issues: Unpartitioned data and over-partitioning
- Fixing memory problems
- How to solve serialization issues
- Caching: when it improves your process, and when is extra work
- Tasks that never finish: detect why this is happening
- Workflow structure: design patterns to properly modularize your ETLs, and improve testability