Back to Data I/O and Preprocessing with Python and SQL

Data I/O and Preprocessing with Python and SQL

Most real-world data isn’t clean, it’s messy, incomplete, and spread across sources like websites, APIs, and databases. In this course, you’ll learn how to collect that data, clean it, and prepare it for analysis using Python and SQL. You’ll start by extracting data from webpages using tools like Pandas and Beautiful Soup, while also learning how to handle unstructured text and apply ethical scraping practices. Next, you’ll access real-time data through APIs, parse JSON files, and clean numerical data using techniques like normalization and binning. You’ll also learn how to manage authentication with API keys and store them securely. Finally, you’ll work with databases: Querying and joining tables using SQL, validating results, and understanding when to use SQL versus Python for different preprocessing tasks. By the end of the course, you’ll be able to turn raw, real-world data into reliable, analysis-ready inputs—a core skill for any data professional.

Status: Data Preprocessing

Status: Relational Databases

BeginnerCourse25 hours

Featured reviews

5.0Reviewed Oct 22, 2025

Sean Barnes is a great teacher and his courses are terrific. How I wish his courses were available when I first decided to learn data science!

5.0Reviewed Jun 27, 2025

Very broad and thorough course on data collection techniques, preprocessing, analysis, and visualization. Highly recommend.

5.0Reviewed Jun 20, 2025

very precise. touches all relevant concepts with perfect examples. Good datasets and great evaluation.

All reviews

Showing: 6 of 6

Mark Nemeth

5.0

Reviewed Jun 28, 2025

Very broad and thorough course on data collection techniques, preprocessing, analysis, and visualization. Highly recommend.

Nikhil Ranjan

5.0

Reviewed Jun 21, 2025

very precise. touches all relevant concepts with perfect examples. Good datasets and great evaluation.

Chisom Chioke

5.0

Reviewed Oct 23, 2025

Sean Barnes is a great teacher and his courses are terrific. How I wish his courses were available when I first decided to learn data science!

FNU Meghashree

5.0

Reviewed Jan 10, 2026

it was a good course

Rafael Arias

3.0

Reviewed Feb 25, 2026

Module 3 places a disproportionate emphasis on statistics, which significantly disrupts the flow of the class. This dense focus makes the material feel tedious and often overwhelming, requiring significant mental stamina to navigate endless calculations without losing interest in the broader subject matter. Furthermore, given that Module 2 was also heavily focused on statistics, the cumulative intensity across these consecutive modules feels excessive. This sustained workload risks student burnout and detracts from a balanced learning experience.

Lord Rimaru

1.0

Reviewed Jun 27, 2026

In module one graded assignment there is a bug and that stops me from getting certified Exercise: Exercise 7 — Extracting Information from HTML Issue: The autograder returns "Object required for grading not found" for Exercise 7, even though the cell runs without errors and produces correct output. What I've verified: Code runs cleanly with no errors, after a full kernel restart and Run All. jobs_df is created correctly: Shape: (264, 5) Columns: ['job_title', 'company', 'location', 'list_date', 'day_of_week'] First rows match expected output format (verified against the assignment's sample output). benefits_list is created correctly: length 264, matching the number of job listings. job_listings (from soup.find_all(...)) returns 264 results — confirming the scrape itself works. The saved .ipynb file on disk was inspected directly (via json.load) to confirm: The graded cell is the only cell of its kind (no duplicate/stale Exercise 7 cells). Cell metadata is {'deletable': False, 'tags': ['graded']} — identical to every other graded cell (Ex 1, 4, 5, 6a, 8) that grades successfully. Upstream dependencies pass fully: Exercise 6a and 6b ("requests") both show "All tests passed!" Tried a hard refresh + full resubmit (closed and reopened the lab, restarted kernel, ran all, saved, resubmitted) — same failure persists. My current Exercise 7 cell code: python# GRADED CELL: Exercise 7 ### START CODE HERE ### job_listings = soup.find_all("div", class_="base-search-card_info") jobs = [] benefits_list = [] # keep benefits separate, used later in Exercise 9 for job in job_listings: title_el = job.find("h3", class_="base-search-card_title") job_title = title_el.get_text(strip=True) if title_el else "" company_el = job.find("h4", class_="base-search-card_subtitle") company = company_el.get_text(strip=True) if company_el else "" location_el = job.find("span", class_="job-search-card_location") location = location_el.get_text(strip=True) if location_el else "" benefits_el = job.find("span", class_="job-posting-benefits_text") benefits = benefits_el.get_text(strip=True) if benefits_el else "None" benefits_list.append(benefits) list_date_el = job.find("p", class_="list-date") list_date = list_date_el.get_text(strip=True) if list_date_el else "" jobs.append([job_title, company, location, list_date]) jobs_df = pd.DataFrame(jobs, columns=["job_title", "company", "location", "list_date"]) ### END CODE HERE ### Request: Could someone confirm whether this is a known autograder issue for this exercise, or advise what specific variable/format the grader expects that isn't covered in the visible instructions?