Master distributed data processing using Scala and Apache Spark. Learn functional programming for big data analysis, optimization techniques.
Master distributed data processing using Scala and Apache Spark. Learn functional programming for big data analysis, optimization techniques.
This comprehensive course teaches distributed big data processing using Scala and Apache Spark. Students learn to manipulate large-scale data using functional programming concepts, focusing on Spark's programming model and distributed collections framework. The curriculum covers essential topics including RDDs, transformations, actions, and performance optimization. Through hands-on programming assignments, participants master data loading, manipulation, and analysis while understanding crucial concepts like shuffling, partitioning, and data locality.
2,271 already enrolled
Instructors:
English
21 languages available
What you'll learn
Read and load data into Apache Spark from persistent storage
Manipulate large datasets using Spark and Scala
Express data analysis algorithms in functional style
Optimize performance by avoiding shuffles and recomputation
Work with Spark SQL and DataFrames
Implement distributed data processing solutions
Skills you'll gain
This course includes:
350 Minutes PreRecorded video
7 programming assignments
Access on Mobile, Tablet, Desktop
FullTime access
Shareable certificate
Get a Completion Certificate
Share your certificate with prospective employers and your professional network on LinkedIn.
Created by
Provided by
Top companies offer this course to their employees
Top companies provide this course to enhance their employees' skills, ensuring they excel in handling complex projects and drive organizational success.
There are 4 modules in this course
This comprehensive course focuses on big data analysis using Scala and Apache Spark, emphasizing distributed data processing. Students learn to manipulate large datasets using functional programming concepts and Spark's distributed collections framework. The curriculum covers essential topics like RDDs, transformations, actions, and optimization techniques. Special attention is given to performance considerations in distributed systems, including data locality and shuffle operations. The course also explores structured data processing using Spark SQL, DataFrames, and Datasets.
Getting Started + Spark Basics
Module 1 · 11 Hours to complete
Reduction Operations & Distributed Key-Value Pairs
Module 2 · 6 Hours to complete
Partitioning and Shuffling
Module 3 · 1 Hours to complete
Structured data: SQL, Dataframes, and Datasets
Module 4 · 8 Hours to complete
Fee Structure
Payment options
Financial Aid
Instructor
Assistant Professor
Heather Miller is an assistant professor in the School of Computer Science at Carnegie Mellon University, where she focuses on data-centric distributed systems and programming languages. Before her current role, she was a research scientist at the École Polytechnique Fédérale de Lausanne (EPFL) and co-founded the Scala Center, which promotes the use of the Scala programming language. Miller has a PhD from EPFL, where she contributed significantly to Scala's development, and she is known for her work on MOOCs that have engaged over a million students. Her research aims to bridge theoretical advancements in programming languages with practical industrial applications.
Testimonials
Testimonials and success stories are a testament to the quality of this program and its impact on your career and learning journey. Be the first to help others make an informed decision by sharing your review of the course.
Frequently asked questions
Below are some of the most commonly asked questions about this course. We aim to provide clear and concise answers to help you better understand the course content, structure, and any other relevant information. If you have any additional questions or if your question is not listed here, please don't hesitate to reach out to our support team for further assistance.