[CMU Database Systems] 02. Modern SQL

17 minute read

Relational Languages

Relational Languages are a type of query language used to manage and interact with relational databases. The concept of relational databases and the relational model was introduced by Edgar Codd in the early 1970s. In his paper, he defined the mathematical notation for how a DBMS (database management system) could execute queries on a relational model database.

For the users, the process of querying a relational database is simplified as they only need to specify the desired result using a declarative language such as SQL (Structured Query Language). The DBMS is responsible for determining the most efficient plan to produce the result. High-end relational databases have a sophisticated query optimizer that can rewrite queries and search for the optimal execution strategy.

Relational algebra is based on the concept of sets, which are unordered collections of unique elements. On the other hand, SQL is based on the concept of bags, which are unordered collections of elements that allow duplicates.

SQL History

SQL (Structured Query Language) is a declarative query language for relational databases that was created in the 1970s by IBM as part of its original DBMS called System R. It was first called “SEQUEL” (Structured English Query Language) but was later changed to SQL. In 1986, ANSI and ISO standard groups officially adopted SQL as the standard database language.

SQL is an evolving language and new features have been added with each new edition of the SQL standard. Below are significant updates released with each new edition of the SQL standard.

SQL:1999 Regular expressions, Triggers, OO
SQL:2003 XML, Windows, Sequences, Auto-Gen IDs
SQL:2008 Truncation, Fancy sorting
SQL:2011 Temporal DBs, Pipelined DML
SQL:2016 JSON, Polymorphic tables

The current standard is SQL:2016, while SQL-92 is the minimum that a DBMS must support to claim they support SQL. However, each vendor may have proprietary extensions beyond the standard.

SQL consists of different classes of commands:

Data Manipulation Language (DML): commands to manipulate data, such as SELECT, INSERT, UPDATE, and DELETE.
Data Definition Language (DDL): commands to define the schema for tables, indexes, views, and other objects.
Data Control Language (DCL): commands for security and access control.

All examples for SQL in this post will use the following database that models a simple college.

CREATE TABLE student (
  sid	INT PRIMARY KEY,
  name	VARCHAR(16),
  login VARCHAR(32) UNIQUE,
  age	SMALLINT,
  gpa	FLOAT
);

CREATE TABLE course (
  cid	VARCHAR(32) PRIMARY KEY,
  name	VARCHAR(32)	NOT NULL
);

CREATE TABLE enrolled (
  sid	INT REFERENCES student (sid),
  cid	VARCHAR(32) REFERENCES course (cid),
  grade	CHAR(1)
);

student(sid, name, login, age, gpa)

sid	name	login	age	gpa
53666	Kanye	kanye@cs	44	4.0
53688	Bieber	jbieber@cs	27	3.9
53655	Tupac	shakur@cs	25	3.5

enrolled(sid,cid,grade)

sid	cid	grade
53666	15-445	C
53688	15-721	A
53688	15-826	B
53655	15-445	B
53666	15-721	C

course(cid,name)

cid	name
15-445	Database Systems
15-721	Advanced Database Systems
15-826	Data Mining
15-799	Special Topics in Databases

Joins

A join operation combines the rows from two or more tables into a single result set. The join is performed based on the values in a common column, known as a join condition, between the tables being joined. The resulting table consists of columns from both tables and only includes rows where the join condition is true.

Joins are essential for retrieving data that is stored in multiple tables and are a fundamental aspect of relational databases. There are several types of joins, each with its own use case:

Inner Join: Returns only the rows that have matching values in both tables being joined.
Left Join or Left Outer Join: Returns all the rows from the left table (table1) and the matching rows from the right table (table2). If there’s no match, the result will have NULL values for the columns from the right table.
Right or Right Outer Join: Returns all the rows from the right table (table2) and the matching rows from the left table (table1). If there’s no match, the result will have NULL values for the columns from the left table.
Full Join or Full Outer Join: Returns all rows from both tables, with NULL values in the columns where there’s no match.
Cross Join: Returns the Cartesian product of the two tables, meaning every row in table1 is combined with every row in table2.

The example below shows an inner join between the enrolled and student tables based on the join condition e.sid = s.sid, where e and s are aliases for the enrolled and student tables, respectively.

Example: Which students got an A in 15-721?

SELECT s.name
  FROM enrolled AS e, student AS s
 WHERE e.grade = 'A' AND e.cid = '15-721'
   AND e.sid = s.sid; 

Aggregates

Aggregate functions are a set of functions in SQL that takes in a set of tuples (or rows) from a table and produce a single scalar value as output. The aggregate functions that are most commonly used in SQL include:

AVG(COL) : The average value of COL.
MIN(COL) : The minimum value of COL.
MAX(COL) : The maximum value of COL.
SUM(COL) : The sum of the values in COL.
COUNT(COL) : The number of tuples in COL.

Aggregate functions can only be used in the SELECT output list, meaning they are used to produce a single value for each set of rows in the output.

Example: Get the number of students with a ‘@cs’ login

The following four queries are equivalent:

SELECT COUNT(login) AS cnt FROM student WHERE login LIKE '%@cs';

SELECT COUNT(*) AS cnt FROM student WHERE login LIKE '%@cs';

SELECT COUNT(1) AS cnt FROM student WHERE login LIKE '%@cs';

SELECT COUNT(1+1+1) AS cnt FROM student WHERE login LIKE '%@cs';

A SELECT statement can contain multiple aggregates.

Example: Get the number of students and their average GPA that have a ‘@cs’ login

SELECT AVG(gpa), COUNT(sid) FROM student WHERE login LIKE '%@cs';

AVG(gpa)	COUNT(sid)
3.8	3

It’s also possible to use the DISTINCT keyword with some aggregate functions (e.g., COUNT, SUM, AVG) to calculate the aggregate value based only on unique values.

Example: Get the number of unique students that have a ‘@cs’ login.

SELECT COUNT(DISTINCT login) FROM student WHERE login LIKE '%@cs';

COUNT(DISTINCT login)
3

The output of other columns outside an aggregate is undefined.

Example: Get the average GPA of students in each course.

-- The value of e.cid is undefined below
SELECT AVG(s.gpa), e.cid
  FROM enrolled AS e, student AS s
 WHERE e.sid = s.sid;

The GROUP BY clause is used to project tuples into subsets based on specified columns and calculate aggregates against each subset.

SELECT AVG(s.gpa), e.cid
  FROM enrolled AS e, student AS s
 WHERE e.sid = s.sid
 GROUP BY e.cid;

group by

When grouping rows using the GROUP BY clause, it’s important to keep in mind that non-aggregated values in the SELECT output clause must appear in the GROUP BY clause.

SELECT AVG(s.gpa), e.cid, s.name
  FROM enrolled AS e, student AS s
 WHERE e.sid = s.sid
 GROUP BY e.cid, s.name;

The HAVING clause can be used to filter the output results based on an aggregate calculation, similar to the way a WHERE clause filters rows based on non-aggregate values. The HAVING clause acts like a WHERE clause for a GROUP BY.

Example: Get the set of courses in which the average student GPA is greater than 3.9.

SELECT AVG(s.gpa) AS avg_gpa, e.cid
  FROM enrolled AS e, student AS s
 WHERE e.sid = s.sid
 GROUP BY e.cid
HAVING avg_gpa > 3.9;

having

It’s important to note that while many major database systems support the use of the HAVING clause, the above syntax may not be compliant with the SQL standard. To make a query standard-compliant, the use of the aggregate function must be repeated in the HAVING clause (e.g., use AVG(s.gpa) instead of avg_gpa).

SELECT AVG(s.gpa) AS avg_gpa, e.cid
  FROM enrolled AS e, student AS s
 WHERE e.sid = s.sid
 GROUP BY e.cid
HAVING AVG(s.gpa) > 3.9;

String Operations

In SQL, strings are a data type that represent sequences of characters. String operations are actions performed on string data, such as manipulating and comparing strings. The SQL standard defines how strings should be treated in SQL, including their case sensitivity and the type of quotes used to define them. However, DBMSs can vary in their implementation of string operations.

SQL-92, the original SQL standard, states that strings are case-sensitive and can only be defined using single quotes. Some DBMSs, such as PostgreSQL and Oracle, follow this standard. Other DBMSs, such as MySQL and SQLite, have different implementations. For example, MySQL allows strings to be defined using single or double quotes and it considers strings to be case-insensitive by default.

	String Case	String Quotes
SQL-92	Sensitive	Single Only
Postgres	Sensitive	Single Only
MySQL	Insensitive	Single/Double
SQLite	Sensitive	Single/Double
MSSQL	Sensitive	Single Only
Oracle	Sensitive	Single Only

-- SQL-92
WHERE UPPER(name) = UPPER('KaNyE')

-- MySQL
WHERE name = 'KaNyE'

SQL provides several functions to manipulate strings, including pattern matching, string functions, and string concatenation.

Pattern Matching is used to compare strings based on a pattern. The LIKE keyword is used to perform pattern matching.

% matches any substring (including an empty string).
_ matches any individual character.

SELECT * FROM enrolled AS e
 WHERE e.cid LIKE '15-%'

SELECT * FROM student AS s
 WHERE s.login LIKE '%@c_'

String Functions is used to perform operations on strings, such as finding substrings (e.g., SUBSTRING(S, B, E)), converting strings to upper or lower case (e.g., UPPER(S)), and more.

SELECT SUBSTRING(name, 1, 5) AS abbrv_name
  FROM student WHERE sid = 53688

SELECT * FROM student AS s
 WHERE UPPER(s.name) LIKE 'KAN%'

Concatenation is the process of combining two or more strings into a single string. In SQL, this is typically done using the || operator in SQL-92 and some other DBMSs, and using the + operator in MSSQL, and the CONCAT function in MySQL.

-- SQL-92
SELECT name FROM student
 WHERE login = LOWER(name) || '@cs'

-- MSSQL
SELECT name FROM student
 WHERE login = LOWER(name) + '@cs'

-- MySQL
SELECT name FROM student
 WHERE login = CONCAT(LOWER(name), '@cs'

Date and Time Operations

Date and time operations are used in SQL to manipulate and modify date and time attributes in a database. They are used both in output and predicates to perform different operations on date and time values. It is important to note that the syntax for date and time operations varies widely across different DBMSs and it’s essential to understand the syntax for the specific DBMS you are working with.

Example: Get the number of days since the beginning of the year

-- Postgres
SELECT DATE('2023-01-02') - DATE('2023-01-01') AS days;

-- MySQL
SELECT DATEDIFF(DATE('2023-01-02'), DATE('2023-01-01')) AS days;

-- SQLlite
SELECT julianday(CURRENT_TIMESTAMP) - julianday('2023-01-01');

-- MSSQL
SELECT DATEDIFF(DAY, DATE('2023-01-01'), DATE('2023-01-02'));

Output Redirection

Output redirection refers to the ability of a DBMS to store the results of a query in a table instead of returning the results to the client (e.g., terminal). This can be useful if the result set is large or if the result needs to be used in subsequent queries. There are two ways to redirect output in SQL:

New Table: To create a new table and store the results of a query in it, the user can use the INTO clause in the SELECT statement. The syntax of the INTO clause varies across different DBMSs. Some DBMSs also allow you to create a temporary table that will be automatically deleted when the session is closed.

-- SQL-92
SELECT DISTINCT cid INTO CourseIds
  FROM enrolled;

-- MySQL
CREATE TABLE CourseIds (
  SELECT DISTINCT cid FROM enrolled);

SELECT DISTINCT cid 
  INTO TEMPORARY CourseIds
  FROM enrolled;

Existing Table: To insert the results of a query into an existing table, the user can use the INSERT INTO statement. The target table must have the same number of columns and data types as the result set, but the column names do not have to match. Different DBMSs have different options and syntax for handling integrity violations, such as invalid duplicates.

-- SQL-92
INSERT INTO CourseIds
(SELECT DISTINCT cid FROM enrolled);

Output Control

Output control in SQL refers to the manipulation of the results returned by a query. By default, the output is unordered and the results are returned as they are produced by the DBMS. However, the ORDER BY clause allows you to sort the results by the values in one or more columns.

SELECT sid, grade FROM enrolled WHERE cid = '15-721'
 ORDER BY grade;

SELECT sid, grade FROM enrolled WHERE cid = '15-721'
 ORDER BY 2;

sid	grade
53123	A
53334	A
53650	B
53666	D

The sort order is ascending by default but can be reversed to descending using the DESC keyword.

SELECT sid, grade FROM enrolled WHERE cid = '15-721'
 ORDER BY grade DESC;

If two or more tuples have the same value in the sort column, you can use additional ORDER BY clauses to break the tie.

SELECT sid, grade FROM enrolled WHERE cid = '15-721'
 ORDER BY grade DESC, sid ASC;

SELECT sid, grade FROM enrolled WHERE cid = '15-721'
 ORDER BY grade DESC, 1 ASC;

sid	grade
53666	D
53650	B
53123	A
53334	A

You can also use any arbitrary expression in the ORDER BY clause to sort the results.

SELECT sid, grade FROM enrolled WHERE cid = '15-721'
 ORDER BY UPPER(grade) DESC, sid + 1 ASC;

The LIMIT clause allows you to control the number of tuples returned in the result. You can specify the maximum number of tuples to return, for example, LIMIT 10 returns the first 10 tuples.

SELECT sid, name FROM student WHERE login LIKE '%@cs'
 LIMIT 10;

-- MSSQL
SELECT TOP 10 sid, name FROM student
 WHERE login LIKE '%@cs'

You can also provide an offset to specify a range of tuples to return. For example, LIMIT 20 OFFSET 10 returns the 11th to 30th tuples.

SELECT sid, name FROM student WHERE login LIKE '%@cs'
 LIMIT 20 OFFSET 10;

However, if you don’t use the ORDER BY clause with a LIMIT, the DBMS may produce different results each time you run the query because the relational model does not impose an ordering.

Nested Queries

A nested query is a query within another query in a DBMS. The outer query acts as the primary query, and the inner query(s) are used to execute more complex logic. These inner queries can appear in different parts of the primary query like the SELECT output targets, the FROM clause, and the WHERE clause.

SELECT Output Targets:

 SELECT (SELECT 1) AS one FROM student;

FROM Clause:

 SELECT name
   FROM student AS s, (SELECT sid FROM enrolled) AS e
  WHERE s.sid = e.sid;

WHERE Clause:

 SELECT name FROM student
  WHERE sid IN (SELECT sid FROM enrolled);

The scope of the outer query is included in the inner query but not the other way around, meaning the inner query can access attributes from the outer query, but the outer query cannot access the attributes from the inner query.

Example: Get the names of students that are enrolled in ‘15-445’

SELECT name FROM student
 WHERE sid IN (
   SELECT sid FROM enrolled
    WHERE cid = '15-445'
);

SELECT name FROM student
 WHERE sid = ANY(
   SELECT sid FROM enrolled
    WHERE cid = '15-445'
)

Note that sid has different scope depending on where it appears in the query. The student table binds the first sid, and the enrolled table binds the second sid.

Nested Query Results Expressions

Nested queries can have different results expressions such as ALL, ANY, IN, and EXISTS:

ALL: The expression must be satisfied for all the rows in the sub-query.
ANY: The expression must be satisfied for at least one row in the sub-query.
IN: Equivalent to =ANY().
EXISTS: At least one row must be returned without comparing it to any attribute in the outer query.

For example, consider a scenario where you want to find the student record with the highest id who is enrolled in at least one course. To solve this problem, you can use a nested query. You can first write an inner query to get the maximum student id from the enrolled table, and then use that in the outer query to get the student record with the highest id.

SELECT sid, name FROM student
 WHERE sid IN (
   SELECT MAX(sid) FROM enrolled
)

SELECT sid, name FROM student
 WHERE sid IN (
   SELECT sid FROM enrolled
    ORDER BY sid DESC LIMIT 1
)

SELECT student.sid, name
  FROM student
  JOIN (SELECT MAX(sid) AS sid
          FROM enrolled) AS max_e
    ON student.sid = max_e.sid;

sid	name
53688	Bieber

Another example is to find all courses that have no students enrolled in it. To solve this problem, you can use a nested query with the EXISTS expression. You can write an inner query to select all records from the enrolled table, and then use the NOT EXISTS expression in the outer query to check if the course id does not exist in the enrolled table. If it does not exist, it means that there are no students enrolled in that course.

SELECT * FROM course
 WHERE NOT EXISTS(
   SELECT * FROM enrolled
    WHERE course.cid = enrolled.cid
);

cid	name
15-799	Special Topics in Databases

Window Functions

Window functions are a type of SQL function that perform a “sliding” calculation across a set of related tuples. Unlike traditional aggregation functions, which group tuples into a single output tuple, window functions keep the tuples separate and calculate the result for each individual tuple.

In SQL syntax, a window function is written as follows:

SELECT ... FUNC-NAME(...) OVER (...)
  FROM tableName

The FUNC-NAME is the name of the window function, which can be any aggregation function or a special window function. There are two special window functions, ROW_NUMBER and RANK.

ROW_NUMBER: Assign a unique number to each tuple in the set.
RANK: Assign a rank based on the order of the tuples after sorting.

SELECT *, ROW_NUMBER() OVER() AS row_num
  FROM enrolled

sid	cid	grade	row_num
53666	15-445	C	1
53688	15-721	A	2
53688	15-826	B	3
53655	15-445	B	4
53666	15-721	C	5

The OVER clause is used to specify the grouping and ordering of the tuples. The PARTITION BY clause is used to determine the grouping, and the ORDER BY clause is used to determine the ordering of the results.

SELECT cid, sid,
    ROW_NUMBER() OVER (PARTITION BY cid) AS row_number
  FROM enrolled
 ORDER BY cid

cid	sid	row_number
15-445	53666	1
15-445	53655	2
15-721	53688	1
15-721	53666	2
15-826	53688	1

SELECT *, ROW_NUMBER() OVER (ORDER BY cid)
  FROM enrolled;

It is important to note that ROW_NUMBER and RANK have different calculation sequences. ROW_NUMBER is computed before the sorting, whereas RANK is computed after the sorting. This can affect the results in certain cases, such as when there are ties in the data.

An example use case of a window function is finding the student with the second highest grade for each course. This can be achieved by using the RANK function, which assigns a rank to each tuple based on the value of the grade column. The query is written as follows:

SELECT * FROM (
  SELECT *, RANK() OVER (PARTITION BY cid
            ORDER BY grade ASC) AS rank
  FROM enrolled) AS ranking
WHERE ranking.rank = 2

The query first selects all tuples from the enrolled table and calculates the rank for each tuple using the RANK function, partitioning by the cid column and ordering by the grade column in ascending order. The resulting ranking is then filtered to return only the tuples with a rank of 2.

Common Table Expressions

Common Table Expressions (CTEs) are a way of defining a temporary result set within a SQL query. They are similar to views, but unlike views, they only exist for the duration of a single query. CTEs are useful in writing more complex SQL queries by breaking them down into smaller, more manageable pieces.

A CTE is defined using the WITH clause, followed by a SELECT statement that defines the contents of the CTE. The CTE name is given after the WITH keyword, followed by the column names in parentheses (if any). The CTE’s SELECT statement is followed by the AS keyword, and the result set of the SELECT statement is then bound to the CTE name.

For example, the following query defines a CTE named cteName with a single column named col1 that contains the value 1.

WITH cteName (col1) AS (
   SELECT 1
)
SELECT * FROM cteName

Multiple CTEs can be defined within the same query, separated by commas.

WITH cte (col1) AS (SELECT 1), cte2 (col2) AS (SELECT 2)
SELECT * FROM cte1, cte2;

Example: Find the student record with the highest id that is enrolled in at least one course.

WITH cteSource (counter) AS (
   SELECT MAX(sid) FROM enrolled
)
SELECT name FROM student, cteSource
 WHERE student.sid = cteSource.maxId;

CTEs can also be recursive, meaning that the result set of the CTE can refer to itself. To define a recursive CTE, the WITH keyword is followed by the keyword RECURSIVE, followed by the SELECT statement that defines the CTE.

For example, the following query defines a recursive CTE named cteSource that generates a sequence of numbers from 1 to 10:

WITH RECURSIVE cteSource (counter) AS (
   (SELECT 1)
   UNION ALL
   (SELECT counter + 1 FROM cteSource
     WHERE counter < 10)
)
SELECT * FROM cteSource

The UNION ALL clause is used to join the current result set with the result set of the CTE itself.

References

[1] CMU Intro to Database Systems / Fall 2022, 02 - Modern SQL

[2] Wikipedia - SQL

Share on

Twitter Facebook LinkedIn

Kunwoo Park