Spark Scala Kubernetes (Piloting)
Description
Embark on an enriching voyage through the Spark universe powered by Scala. Beginning with the foundational understanding of Spark's architecture, you'll see its edge over Hadoop's MapReduce. Delve into the realm of containerization, with a special focus on deploying Spark apps on Kubernetes. Discover the intricacies of high-level Data APIs and master data interactions with external storages. Journey further into Spark's DSL, SQL, and optimization avenues. Equip yourself with testing strategies for Spark applications and conclude with a deep dive into Spark's structured streaming. This comprehensive course is a blend of theory and hands-on practice.Participants will embark on an enriching voyage through the Spark universe powered by Scala, beginning with a foundational understanding of Spark's architecture and seeing its edge over Hadoop's MapReduce.
wydawany jest certyfikat
Objectives
- Foundational Spark Principles: Dives into Spark's foundational concepts and architecture, comparing its efficiency to Hadoop's MapReduce, and exploring its diverse resource managers.
- Spark & Kubernetes Synergy: Equips participants with knowledge about the containerization of Spark applications, understanding Kubernetes dynamics, and efficient deployment techniques.
- Data API Proficiency: Delves deep into Spark's high-level Data APIs - DataFrame and DataSet - highlighting differences, parallelization, and optimal storage methods.
- External Data Management Mastery: Focuses on robust techniques for data interaction with diverse external storages, optimizing data formats, and efficient data transfers.
- Spark Optimization & Streamlining: Addresses the core challenges in Spark, understanding optimization strategies, and diving into structured streaming techniques and applications.
Target Audience
Developers, architects
Prerequisites
Basic Java, Scala programming skills. Unix/Linux shell familiarity. Experience with databases (Kafka is optional).
Roadmap
- Module 1: Spark concepts and architecture (theory 2h 30m, practice 1h 30m)
- Module 2: Containerization and deploy Spark Applications to Kubernetes - (theory 1h, practice 1h)
- Module 3: High Level Data API: DataFrame, DataSet (theory 2h, practice 2h)
- Module 4: Loading data from/in external storages (theory 1h, practice 3h)
- Module 5: Spark DSL and SQL (theory 2h, practice 1h)
- Module 6: Spark optimization cases (theory 2h, practice 1h)
- Module 7: Testing Spark Applications - (theory 2h, practice 1h)
- Module 8: Spark Structure Streaming - (theory 2h, practice 1h)
Spark concepts and architecture
Explore Spark's superiority over Hadoop's MapReduce with hands-on examples. Dive into Lambda architecture, understand batch vs. streaming. Master Spark's resource managers: Kubernetes, YARN, Standalone. Learn to initiate Spark applications. Comprehensive definitions included.
Containerization and deploy Spark Applications to Kubernetes
Master containerization: delve into Kubernetes terminology. Compare Kubernetes with YARN. Grasp dynamic resource allocation. Learn to containerize and deploy Spark on Kubernetes. Kickstart Spark applications seamlessly.
High Level Data API: DataFrame, DataSet
Explore high-level Data APIs: DataFrame & DataSet. Unravel differences between RDD, DataFrame, and DataSet. Learn creation, parallelization techniques. Dive into DataFrame & DataSet analysis, control via plans and DAGs. Master saving methods to HDFS, FTP, S3.
Loading data from/in external storages
Master data loading techniques from external storages: Dive into reading/writing from HDFS, S3, FTP, FS. Choose optimal data formats. Learn parallelized JDBC interactions. Create DataFrames & DataSets from Kafka topics. Efficiently load data into Cassandra.
Spark optimization cases
Delve into Spark optimization scenarios: Address 'out of memory' issues, manage small files in HDFS, correct skewed data, enhance join speeds, optimize large table broadcasts, resource sharing strategies, and leverage AQE & DPP for performance tuning.
Testing Spark Applications
4 levels of quality for Spark Application
Unit Testing for Spark Application
Problems with Unit testing Spark Application
Libraries and Solutions
Spark Structure Streaming
Streaming DataFrame & Dataset
DF, DS based on the Kafka Topic
Loading Data to Cassandra
Working with Spark, Cassandra State
Optimization features