Scalable Machine Learning on Big Data using Apache Spark

Master machine learning techniques for big data using Apache Spark, from data processing to advanced ML algorithms implementation.

This course teaches scalable machine learning techniques for big data using Apache Spark. Students will learn to leverage cluster computing and distributed storage to process extremely large datasets efficiently. The curriculum covers Apache Spark fundamentals, including RDD and DataFrame APIs, and progresses to implementing machine learning algorithms using SparkML. Learners will gain hands-on experience with statistical calculations, dimensionality reduction, clustering, and supervised learning models on big data. The course emphasizes practical skills in building and optimizing machine learning pipelines for large-scale data processing and analysis.

3.8

(1,248 ratings)

23,083 already enrolled

Instructors:

Romeo Kienzler

English

پښتو, বাংলা, اردو, 2 more

This course includes

6 Hours

Of Self-paced video lessons

Intermediate Level

Completion Certificate

awarded on course completion

2,699

Audit For Free

Add to compare

What you'll learn

Understand Apache Spark's architecture and internal workings for big data processing

Implement parallel data processing strategies using RDD and DataFrame APIs

Apply statistical calculations and dimensionality reduction techniques on large datasets

Develop and optimize machine learning pipelines using SparkML

Implement clustering algorithms like K-means on big data

Build and evaluate supervised learning models such as linear and logistic regression

Skills you'll gain

apache spark

big data

machine learning

sparkml

data processing

cluster computing

distributed storage

rdd

dataframe

dimensionality reduction

This course includes:

2.45 Hours PreRecorded video

11 assignments

Access on Mobile, Tablet, Desktop

FullTime access

Shareable certificate

Closed caption

Get a Completion Certificate

Share your certificate with prospective employers and your professional network on LinkedIn.

Created by

IBM

Provided by

Coursera

Top companies offer this course to their employees

Top companies provide this course to enhance their employees' skills, ensuring they excel in handling complex projects and drive organizational success.

There are 4 modules in this course

This course provides a comprehensive introduction to scalable machine learning using Apache Spark for big data applications. Students will learn the fundamentals of Apache Spark, including its internal workings and APIs like RDD and DataFrame. The curriculum covers parallel data processing strategies, functional programming basics, and the use of SparkSQL. Learners will gain hands-on experience in applying statistical calculations, dimensionality reduction techniques like PCA, and machine learning algorithms such as clustering and regression on large datasets. The course emphasizes the use of SparkML pipelines for efficient data processing and model building. By the end of the course, students will be able to implement both supervised and unsupervised learning tasks on big data, and understand how to optimize machine learning workflows for scalability.

Week 1: Introduction

Module 1 · 2 Hours to complete

Week 2: Scaling Math for Statistics on Apache Spark

Module 2 · 1 Hours to complete

Week 3: Introduction to Apache SparkML

Module 3 · 1 Hours to complete

Week 4: Supervised and Unsupervised learning with SparkML

Module 4 · 1 Hours to complete

Fee Structure

Payment options

Financial Aid

Instructor

Romeo Kienzler

3.7 rating

188 Reviews

7,03,752 Students

10 Courses

Chief Data Scientist at IBM Specializing in Data Science and Parallel Processing Architectures

Romeo Kienzler is the Chief Data Scientist and Course Lead at IBM, where he leverages nearly two decades of experience in software engineering, database administration, and information integration. He holds a Master of Science from the Swiss Federal Institute of Technology (ETH) in Information Systems, Bioinformatics, and Applied Statistics. Since joining IBM in 2012, Romeo has focused his research on massive parallel data processing architectures and has published numerous works in the field through international publishers and conferences. In addition to his professional contributions, he is actively involved in various open-source projects. On Coursera, he teaches several courses, including Deep Learning with Keras and TensorFlow, Introduction to Big Data with Spark and Hadoop, Scalable Machine Learning on Big Data using Apache Spark, and Tools for Data Science, all designed to equip learners with essential skills in data science and machine learning

This course includes

6 Hours

Of Self-paced video lessons

Intermediate Level

Completion Certificate

awarded on course completion

2,699

Audit For Free

Add to compare

Testimonials

Testimonials and success stories are a testament to the quality of this program and its impact on your career and learning journey. Be the first to help others make an informed decision by sharing your review of the course.

Frequently asked questions

Below are some of the most commonly asked questions about this course. We aim to provide clear and concise answers to help you better understand the course content, structure, and any other relevant information. If you have any additional questions or if your question is not listed here, please don't hesitate to reach out to our support team for further assistance.

When will I have access to the lectures and assignments?

What will I get if I subscribe to this Certificate?

What is the refund policy?