MySQL Technical Internals Reading Notes

MySQL Technical Internals Reading Notes

Database: Collection of files on the file system Instance: Process composed of background threads and shared memory area Architecture: Single process multi-threaded Configuration reading: Last configuration file takes precedence Features: Plugin-based table engine

Various Storage Engines

InnoDB

Row-level locking design, supports foreign keys, default read operations do not generate locks MVCC (Multi-Version Concurrency Control) achieves high concurrency

MyISAM

Does not support transactions, table-level locking design, supports full-text indexing Buffer pool only caches indexes, not data files

Connecting to Database

The essence is process communication between connecting processes and database instances Therefore supports: TCP, pipes, named pipes, UNIX domain sockets, etc.

When connecting via TCP, the database first checks the privilege view for authentication, with the table name being ‘user’

InnoDB Storage Engine

Architecture: Multiple background threads + memory pool | Files

The main function of background threads is to refresh data in the memory pool, ensuring that the cache in the buffer pool contains the most recent data, and additionally flushing modified data to disk files

Master Thread Asynchronously refreshes data to disk, ensuring data consistency

IO Thread Handles callbacks for asynchronous write IO requests

Purge Thread Recycles undo pages no longer needed after transaction commits

Page Cleaner Thread Completes dirty page refresh operations in separate threads

Buffer pool can be compared to kernel page cache Pages are flushed back to disk using checkpoint mechanism Types of pages in the pool: index pages, data pages, undo pages, insert buffer, adaptive hash index, lock information, data dictionary information

Multiple buffer pool instances exist, each page allocated to different caches based on hash value to reduce internal resource competition. Default buffer pool page size is 16KB, managed using LRU

Full table scans may cause LRU list pollution

Checkpoint

Used to solve:

  1. Shorten database recovery time
  2. When buffer pool is insufficient, flush dirty pages back to disk
  3. When redo log is unavailable, flush dirty pages

When database crashes, it doesn’t need to replay all logs, only needs to recover redo logs after checkpoint

When too many dirty pages cause insufficiency, checkpoint is forced to flush back to disk

When redo log needs to be reused, trigger checkpoint

Two types:

Sharp Checkpoint: Flush all dirty pages back to disk when database shuts down Fuzzy Checkpoint: Only flush part of dirty pages, entire process is asynchronous Monitor disk IO load conditions, perform refresh when load is low

Insert Buffer

Inserting clustered index (primary key) is generally sequential (auto-increment), no random disk reads required, thus fast insertion speed Other secondary indexes are non-clustered, requiring random reads during insertion due to B+ tree structure

For non-clustered index insertion:

  1. Determine if the inserted non-clustered index page is in buffer pool, if so, insert directly
  2. If not, temporarily store write requests, then merge them through a certain frequency between insert buffer and secondary index page child nodes, usually merging multiple insertions into one operation (within an index page), greatly improving insertion performance

Similarly, there are corresponding buffers for deletion and modification operations

Internally implemented using B+ tree, stored in shared tablespace, responsible for insert buffering of secondary indexes across all tables

Double Write

Doublewrite simultaneously writes a copy of a page to prevent data loss. When write failure occurs, first restore the page through its copy, then perform redo

Asynchronous IO

The biggest advantage is enabling IO Merge to improve IO performance Natively supported by kernel Dirty page refresh, disk writing, and read-ahead style reading are all completed through AIO

Indexes

iostat showing 100% disk usage may indicate excessive indexes have been added

B+ Tree Index

This type of index cannot find specific rows with given key values, can only find the page containing the searched data rows, then the database reads the page to memory and searches within memory

B+ tree performs extensive page splits during insertion while maintaining balance, therefore B+ tree can perform rotation-like operations to minimize page splits as much as possible

Number of levels generally maintained at 2-4 layers, mechanical disks can perform at least 100 IOs per second, meaning query time is approximately 0.02 - 0.04 seconds

The difference between clustered and secondary indexes lies in whether the leaf nodes contain complete row information

Clustered Index

Leaf nodes contain complete row record data of the entire table, leaf nodes called data pages, each data page linked through a bidirectional linked list

Secondary Index

Leaf nodes besides key values, contain a pointer to row primary key value. Therefore, the number of IO operations required for secondary index to find data pages is secondary index levels + clustered index levels

Heap table: Stored according to insertion order. At this time, pointers can be row identifiers, enabling faster location of row data

B+ tree index applicable scenarios: Suitable when accessing very little of the table, only high selectivity fields can have indexes added. High selectivity viewed through Cardinality value, which represents the estimated count of unique records in the index

Composite Index

Indexing multiple fields simultaneously, also a B+ tree. Leaf node key value groups will be sorted, if stored as (a,b) format, then where a = xx can also use this index, but where b = xx cannot because it’s unordered, only the second column of leaf nodes is ordered when the first column is determined

Covering Index

Obtain query records from secondary index without needing to query clustered index records, much smaller than clustered index, can reduce large amounts of IO operations

Why optimizer doesn’t choose index?

Often occurs with range lookups, JOIN table scenarios

Adaptive Hash Index (AHI)

Suitable for equality lookups Engine observes that creating hash index can bring speed improvements, then creates hash index. Automatically creates hash indexes for certain hot pages based on access frequency and patterns

Conditions:

  1. Continuous access pattern to this page must be the same
  2. Accessed 100 times with this pattern
  3. Page accessed N times through this pattern

Inverted Index

Used for full-text search, uses red-black tree as full-text search index cache, functions similarly to insert buffer

Locks

latch: Lightweight lock, requires very short locking time, no deadlock detection and handling, only ensures no deadlocks through locking order, exists in each data structure object | Mutex and read-write locks

lock: Object is transaction, used to lock database objects such as tables, pages, rows General lock only released after transaction commit or rollback, related to transaction isolation level, has deadlock detection and handling. Exists in hash table of lock manager

Row-level Lock

Shared lock S Lock, allows transaction to read one row of data Exclusive lock X Lock, allows transaction to delete or update one row of data Only S Locks are compatible with each other InnoDB supports multi-granularity locking, allowing both row-level and table-level locks to exist simultaneously within a transaction

Intention lock: Divides locked objects into multiple levels, means transaction wishes to lock at finer granularity Intention Shared lock IS Lock: Transaction wants to obtain shared locks on several rows in a table Intention Exclusive lock IX Lock: Transaction wants to obtain exclusive locks on several rows in a table

Lock hierarchy: Database | Table1, Table2 … | Page1,2… | Record 1,2….

If wanting to add X lock to record 1, then must add intention exclusive locks to all its ancestors If other incompatible locks exist, then need to wait

Consistent Non-locking Read

Read current executing time database row data through multi-version control. If current row is being updated or deleted, read operations won’t wait for row lock release, instead directly read snapshot data

Under default transaction isolation levels READ COMMITTED and REPEATABLE READ, default is non-locking consistent read, former always reads latest version of snapshot data, latter always reads row data version at transaction start

Consistent Locking Read

In some cases we need strong logical consistency, statement level supports consistent locking read SELECT … FOR UPDATE (add X lock) SELECT … LOCK IN SHARE MODE (add S lock, other transaction adding X lock will be blocked) Must be used within a transaction

Auto-increment

AUTO INCREMENT columns have locks on auto-increment counters, a special table lock, immediately released after completing auto-increment insertion, new versions use lightweight mutex implementation

Foreign Keys

InnoDB automatically adds indexes to foreign key columns, this avoids table locks Foreign key value insertion or updates first query parent table records, therefore require locking reads to prevent data inconsistency

Row Locks

  1. Record Lock: Lock on single row record
  2. Gap Lock: Locks a range but does not include the record itself
  3. Next-key Lock: Combination of previous two, locks a range and the record itself

For row queries, generally use the 3rd type

Phantom Read Problem

Refers to same transaction continuously executing twice identical SQL statements may result in different outcomes, second SQL statement may return previously non-existent rows. Violates transaction isolation, one transaction can see another transaction’s results

Solved using Next-key Locking, not only locks values but also locks query ranges, adding X lock to ranges, therefore modifications to this range are not allowed, solving the problem Adopted under REPEATABLE READ transaction isolation level

Dirty Read

Dirty data: Transaction modifies row records in buffer pool but hasn’t committed yet. Refers to uncommitted modified data If another transaction reads uncommitted data, it violates database isolation Occurrence condition: Transaction isolation level is READ UNCOMMITTED Application scenario: Slave node replicas, and queries on this node don’t require particularly accurate return values

Non-repeatable Read

Difference from dirty read: Dirty read reads uncommitted data, non-repeatable read reads committed data Generally non-repeatable read is acceptable since it’s already committed Level: Allowed under READ COMMITTED

Lost Update

One transaction’s update causes another transaction’s update operation to be overwritten, resulting in data inconsistency Solution: Transaction operation serialization

Deadlock

Brute force method: Rollback and restart any wait, but poor performance

Timeout mechanism: When waiting time exceeds threshold, one transaction rolls back, another waiting transaction continues Wait graph: DFS deadlock detection, more proactive, existence of cycle indicates deadlock detected, choose to rollback transaction with minimum undo volume

Lock Escalation

InnoDB has no lock escalation, manages locks based on each page accessed by each transaction, using bitmap. Regardless of how many locks, overhead remains consistent

Transactions

Either all modifications are saved, or none are saved A Atomicity C Consistency, before and after transaction begins and ends, database integrity constraints are not violated I Isolation D Durability, once transaction commits, results are permanent, even crash can recover data

Flat transaction, flat transaction with savepoints Chained transaction Nested transaction Distributed nodes, running transactions between different nodes

Implementation

redo: Contains redo log buffer and redo log file, used to ensure durability, sequential write undo: Helps transaction rollback and MVCC, requires random read/write

Ensure buffered log file writes, call fsync each time, prevent content staying in file system buffer Log files stored in blocks (512 bytes), split when exceeded During recovery only consider log portion starting from checkpoint

Transaction Isolation Levels

READ UNCOMMITTED Browse protection (only for transactions) READ COMMITTED No phantom read protection REPEATABLE READ Solves phantom read SERIALIZABLE

Lower isolation level, fewer locks requested by transactions or shorter lock holding time

Backup and Recovery

  • Hot backup: Online backup
  • Cold backup: Offline backup
  • Warm backup: Online backup with locks

Complete backup, incremental backup, log backup