Jason Fang - Projects

SQL is commonly used by data analysts and data scientists. I had used different types of SQL, such as SAS proc SQL, PopSql, MySQL Workbench, MS SQL. In the end, I decided to use Amazon Redshift. First, I don't need to install any SQL software on my laptop or computer. It can run a large database on the AWS server rather than on my antiquated laptop. Second, I'm very familiar with AWS services, and I use AWS S3 a lot and upload almost all of my work to S3. Third, Redshift is very powerful and user-friendly. I could generate charts directly from my query in Redshift. Also, I can see my queries' history and share them with my colleagues.

Amazon Redshift

This is an example of how I use Redshift. It's very intuitive. I just loaded the data from my AWS S3 bucket by using IAM role method. Redshift query editor v2 was introduced in September 2021. It's very convenient and powerful.

Database Design

Designing databases and building entity-relationship diagrams are essential skills for data scientists. You have to know the primary key, foreign keys, candidate keys, prime and non-prime attributes in your database so that you can better analyze the data.

Here are several reasons why data scientists need to know database design:
1. Understand why a database was designed in a certain way.
2. Learn quicker how to navigate your way around.
3. Find some potential vulnerabilities, flaws, and mistakes.
4. Communicate effectively with data engineers, data architects, and so on.

Normalization

Data scientists need to use normalization to organize the columns (attributes) and tables (relations) of a relational database and reduce data redundancy and improve data integrity. To reduce potential errors and increase data efficiency, I always create a database by using third normal form (3NF). There's no need to have duplicates in your database, and it will take more space and increase data analyzing time.

Types of normalization:
1. First Normal Form (1NF): Doesn't contain duplicate rows; and every cell contains only one value.
2. Second Normal Form (2NF): If it's in first normal form (1NF); and every non-prime attribute of the table is dependent on the whole of every candidate key.
3. Third Normal Form (3NF): If it's in second normal form (2NF); and every non-prime attribute is non-transitively dependent on every candidate key.

Normalization Case

This is an example of how I examine a database and then normalize it. This online shop transaction database is in first normal form (1NF), then I changed it to second normal form (2NF) and third normal form (3NF). Here are several steps that I did to normalize it.

Steps:
1. Check if it's in first normal form (1NF).
I checked if there are duplicated rows in the database by running the following query so that I can see the total numbers of rows are the total numbers of distinct rows. They matched, so it's in first normal form (1NF).

2. Check if it's in second normal form (2NF).
I need to find the candidate key, prime and non-prime attributes in this database. The transaction id is the most obvious primary key and candidate key, but there is more than one candidate key. To find the candidate keys in the database, I first think of the purpose of this database. Because it's a database to record every transaction, so purchase time and customer id make up the other candidate key. The first three columns are prime attributes and the rest columns are non-prime attributes. Then I found that customer id is sufficient to identify the first name, second name, shipping state, and loyalty discount. It's not in second normal form, so I dropped these columns and create a separate table for these attributes. I selected the unique customer rows and created another table. Then I dropped those columns in the original table (Couldn't drop multiple columns at once in Redshift). Now, this database is in second normal form (2NF).

3. Check if it's in third normal form (3NF).
Because I created a new table in the last step, I had to check two tables in this step. The customer table is in third normal form (3NF), but the modified transaction table is not in third normal form (3NF). The description and retail price columns are dependent on the item column. So split this table into two tables. The process is the same as the last one. I created another table for the item, description, and retail price, then I dropped the description and retail price from the original table.

4. Results
Now there are three tables, but I removed many unnecessary duplicated rows. It's not obvious in this example, but if a database contains billions of records, a database in third normal form (3NF) is much more efficient.

Portfolio