Hello and welcome. In this video, we'll give you a few hints and tips for working with Real World Data-sets. Many of the real world data sets are made available as.CSV files. These are text files which contain data values typically separated by commas. In some cases a different separator such as a semicolon may be used. For this video, we will use an example of a file called DOGS.CSV. Although this is a fictional data set that contains names of dogs and their breeds, we will use it to illustrate concepts that you will then apply to real datasets. Sample contents of the DOGS.CSV file are shown here. The first row in the table in many cases contains attribute labels which map to column names in a table. In DOGS.CSV, the first row contains the name of three attributes. Id is the name of the first attribute and the subsequent rows contain Id values of 1, 2, and 3. The name of the dog is the second attribute. In this case the dog names Wolfie, Fluffy, and Huggy are the values. The third attribute is called breed, either the dominant breed or pure breed name. It has values of German Shepherd, Pomeranian, and Labrador. As we've just seen, CSV files can have the first or a header row that contains the names of the attributes. If you're loading the data into the database using the visual load tool in the database console, ensure the header in first row is enabled. This will map the attribute names in the first row of the CSV file into column names in the database table, and the rest of the rows into the data rows in the table, as shown here. Note that the default column names may not always be database or query friendly, and if that is the case, you may want to edit them before the table is created. Now, let's talk about querying column names that are lower or mixed case, that is a combination of upper and lowercase. Let's assume we loaded the DOGS.CSV file using the default column names from the CSV. If we try to retrieve the contents of the Id column using the query, select id from DOGS, we'll get an error as shown indicating the id is not valid. This is because the database parser assumes uppercase names by default. Whereas when we loaded the CSV file into the database it had the Id column name in mixed case i.e an uppercase I and a lowercase d. In this case, to select data from a column with a mixed case name, we need to specify the column name in its correct case within double quotes as follows. Select * "Id" from DOGS. Ensure you use double quotes around the column name and not single quotes. Next, we'll cover querying column names that have spaces and other characters. In a CSV file if the name of the column contain spaces, by default the database may map them to underscores. For example, in the name of dog column, there are spaces in between the three words. The database may change it to Name_of_Dog. Other special characters like parentheses or brackets may also get mapped to underscores. Therefore, when you write a query ensure you use proper case formatting within quotes and substitute special characters to underscores as shown in this example. Select "Id," "Name_of_Dog," "Breed__dominant_breed_if_not_pure_breed_"from dogs. Please note the underscores separating the words within double quotes. Also note the double underscore between breed and dominant as shown. Finally, it's also important to note the trailing underscore after the word breed near the end of the query. This is used in place of the closing bracket. When using quotes in Jupyter notebooks, you may be issuing queries in a notebook by first assigning them to Python variables. In such cases if your query contains double quotes for example, to specify a mixed case column name, you could differentiate the quotes by using single quotes for the Python variable to enclose this SQL query and double quotes for the column names. For example, selectQuery ='select "Id" from dogs.' Now, what if you need to specify single quotes within the query, for example, to specify a value in the where clause? In this case you can use backslash as the escape character as follows, select Query = 'select * from dogs where "Name_of_Dog"=\'Huggy\' '. If you have very long queries such as join queries or nested queries, it may be useful to split the query into multiple lines for improved readability. In Python notebooks, you can use the backslash character to indicate continuation to the next row as shown in this example. %sql select "Id," Name_of_Dog," \ from dogs \ where"Name_of_Dog" = 'Huggy.' It would be helpful at this point to take a moment to review the special characters as shown. Please keep in mind that you might get an error if you split the query into multiple lines in a Python notebook without the backslash. When using SQL magic, you can use the double percent SQL in the first line of the cell in Jupyter Notebooks. It implies that the rest of the content of the cell is to be interpreted by SQL magic. For example %% sql new row select "Id", "Name_of_dog," new row, from dogs, new row, where "Name_of_dog = 'Huggy.' Again, please note the special characters as shown. When using %% sql the backslash is not needed at the end of each line. At this point you might be asking, how would you restrict the number of rows retrieved? It's a good question because a table may contain thousands or even millions of rows, and you may only want to see some sample data or look at just a few rows to see what kind of data the table contains. You may be tempted to just do select * from table name to retrieve the results in a Pandas data frame and do a head function on it. But, doing so may take a long time for a query to run. Instead, you can restrict the results set by using the limit clause. For example, use the following query to retrieve just the first three rows in a table called census data. Select * from census_data limit 3. In this video we looked at some considerations and tips for working with real-world datasets. Thanks for watching.