Data Normalization in SQL
Why Normalization
Normalization of data is done to reduce the redundant/duplicate data and making the database free from anomalies like insertion, deletion, and updation anomalies.
It makes the queries simple and faster.
It saves space.
It organizes the database.
It eliminated the duplicate / repeated data.
NOTE: The data that we have must make some sense.
Now, what is an anomaly?
Ex; suppose there is some data at multiple places, we updated the data at one place but forget to update the data at other cells. Now it will cause trouble for us as we cant conclude which data is correct. As the data is something at one place and something else at another place.
There are 3 anomalies: Insertion, deletion, and updation.
Let make it by example.
Suppose for a new position in a company, Mr. Chintu is selected but his department is not allotted to him in that case if we want to update his info into the database, we have to set the department value is null. This is the case with only one employee.
Similarly, if we have to set this dept. data for thousands of employees who are in the same condition, then this dept data will be repeated for all those thousands of employees. This is an ex of an insertion anomaly.
2. What if Mr chintu leaves the company and no longer he is the head of marketing. In that case, all the employee record has to be updated and if we forget to update any record, this will lead to the data inconsistency. This is updation anomaly.
3. Deletion anomaly- In our employee table, there are 2 columns: Employee and department. At the end of the financial year, if employee records are deleted, the department info is also lost.
Types of normal form.
1NF - In the first normal form, we deal with the problem of atomicity. Atomicity means that the values of the cell are in their most reduced form. Just like the atoms, they cannot be further divided.
In layman, a single cell must contain a single value to follow the atomicity rule.
NOTE: If a table contains composite or multivalued attributes, it violates the first normal form.
Suppose in a table, we have employee id, name, phone no, salary as columns.
Since one employee can have more than one phone no, so phone no is a multivalued attributed. So we need to make separate rows to solve this problem. This way we can say that the table has achieved its atomicity.
2NF- the table is said to be in 2nd normal form if it follows:
*1nf
*The table should not contain partial dependency.
Now, what is a partial dependency?
The proper subset of a candidate key determines a non-prime attribute.
Now, what's a non-prime attribute?
Attributes that form a candidate key in a table are prime attributes and the rest are nonprime.
For the table, that we have columns like employee id and department id and the nonprime attribute is the office location.
EX: in the table, employee id and department id make the primary key. the office location column depends on the department id. So we have to split the table accordingly.
the first table has employment id and department id and the 2nd one has dep id and office location.
The column office location is fully dependent on the primary key of that table which is nothing but the dep id.
3.)3NF- the table is said to be in 3rd normal form if it
* follows 2nf
* no non-prime attribute is transitively dependent on any non-prime attribute which depends on any non-prime attribute.
let us make it simple, if c depends on b and b depends on a then, transitively c depends on a. So there should be no transitive dependency of nonprime attributes on another non-prime attribute. The nonprime attribute should be dependent on the prime attribute.
Ex: in a table, student id determines subject id and subj id determines subjects. so student id determines subj via subj id. This shows that there is a transitive functional dependency in the table and it violates the 3nf.
What to do?
Divide the table, one table has student id, student name, subj id, and address and another table has subj id and subj. All the nonprime attributes are fully functionally dependent on the primary key.
In the first table columns like student name, subj id and address are completely dependent on student id. Similarly in the 2nd table subj only depends on subj id.
4. BCNF- Boyce Codd normal form. also called 3.5 nf. It is the higher version of 3nf. Before BCNF, you must know what is a super key?
Super key is a set of one or more attributes(columns) that can uniquely identify a row in a table. A candidate key is the subset of a superkey.
a table is said to be in BCNF if it follows
* 3nf
*every functional dependency A-> B, then A has to be the super key of that particular table.
Ex: in a table, one student can enroll for multiple subjects, there can be multiple teachers for one subject, and for each subject, a professor is assigned to the student. all normal forms are followed except.
This is because the student id and subject form the primary key and the subject columns is a prime attribute. But there is one more dependency that the professors depend on subjects. while the subject is the prime attribute and the professor is the non-prime attributes, which is not the rule for BCNF,.
So we will divide it into 2 parts. one table will contain the student and the professor's id. Here we created a new column named professor id. And the other table includes professor id, subj, and professor.
NOTE: by creating a new column "professor id ", we have removed the non-prime attributes functional dependency. In the 2nd table professor id is the super key for that table and the remaining columns of it are functionally dependent on it.