Welcome back. So far you've learned a lot about dirty data and how to clean up the most common errors in a dataset. Now we're going to take that a step further and talk about cleaning up multiple datasets. Cleaning data that comes from two or more sources is very common for data analysts, but it does come with some interesting challenges. A good example is a merger, which is an agreement that unites two organizations into a single new one. In the logistics field, there's been lots of big changes recently, mostly because of the e-commerce boom. With so many people shopping online, it makes sense that the companies responsible for delivering those products to their homes are in the middle of a big shake-up. When big things happen in an industry, it's common for two organizations to team up and become stronger through a merger. Let's talk about how that will affect our logistics association. As a quick reminder, this spreadsheet lists association member ID numbers, first and last names, addresses, how much each member pays in dues, when the membership expires, and the membership types. Now, let's think about what would happen if the International Logistics Association decided to get together with the Global Logistics Association in order to help their members handle the incredible demands of e-commerce. First, all the data from each organization would need to be combined using data merging. Data merging is the process of combining two or more datasets into a single dataset. This presents a unique challenge because when two totally different datasets are combined, the information is almost guaranteed to be inconsistent and misaligned. For example, the Global Logistics Association's spreadsheet has a separate column for a person's suite, apartment, or unit number, but the International Logistics Association combines that information with their street address. This needs to be corrected to make the number of address columns consistent. Next, check out how the Global Logistics Association uses people's email addresses as their member ID, while the International Logistics Association uses numbers. This is a big problem because people in a certain industry, such as logistics, typically join multiple professional associations. There's a very good chance that these datasets include membership information on the exact same person, just in different ways. It's super important to remove those duplicates. Also, the Global Logistics Association has many more member types than the other organization. On top of that, it uses a term, "Young Professional" instead of "Student Associate." But both describe members who are still in school or just starting their careers. If you were merging these two datasets, you'd need to work with your team to fix the fact that the two associations describe memberships very differently. Now you understand why the merging of organizations also requires the merging of data, and that can be tricky. But there's lots of other reasons why data analysts merge datasets. For example, in one of my past jobs, I merged a lot of data from multiple sources to get insights about our customers' purchases. The kinds of insights I gained helped me identify customer buying patterns. When merging datasets, I always begin by asking myself some key questions to help me avoid redundancy and to confirm that the datasets are compatible. In data analytics, compatibility describes how well two or more datasets are able to work together. The first question I would ask is, do I have all the data I need? To gather customer purchase insights, I wanted to make sure I had data on customers, their purchases, and where they shopped. Next I would ask, does the data I need exist within these datasets? As you learned earlier in this program, this involves considering the entire dataset analytically. Looking through the data before I start using it lets me get a feel for what it's all about, what the schema looks like, if it's relevant to my customer purchase insights, and if it's clean data. That brings me to the next question. Do the datasets need to be cleaned, or are they ready for me to use? Because I'm working with more than one source, I will also ask myself, are the datasets cleaned to the same standard? For example, what fields are regularly repeated? How are missing values handled? How recently was the data updated? Finding the answers to these questions and understanding if I need to fix any problems at the start of a project is a very important step in data merging. In both of the examples we explored here, data analysts could use either the spreadsheet tools or SQL queries to clean up, merge, and prepare the datasets for analysis. Depending on the tool you decide to use, the cleanup process can be simple or very complex. Soon, you'll learn how to make the best choice for your situation. As a final note, programming languages like R are also very useful for cleaning data. You'll learn more about how to use R and other concepts we covered soon.