MySQL Technical Internals Reading Notes
Database: Collection of files on the file system Instance: Process composed of background threads and shared memory area Architecture: Single process multi-threaded Configuration reading: Last configuration file takes precedence Features: Plugin-based table engine
Various Storage Engines
InnoDB
Row-level locking design, supports foreign keys, default read operations do not generate locks MVCC (Multi-Version Concurrency Control) achieves high concurrency
MyISAM
Does not support transactions, table-level locking design, supports full-text indexing Buffer pool only caches indexes, not data files
Connecting to Database
The essence is process communication between connecting processes and database instances Therefore supports: TCP, pipes, named pipes, UNIX domain sockets, etc.
When connecting via TCP, the database first checks the privilege view for authentication, with the table name being ‘user’
InnoDB Storage Engine
Architecture: Multiple background threads + memory pool | Files
The main function of background threads is to refresh data in the memory pool, ensuring that the cache in the buffer pool contains the most recent data, and additionally flushing modified data to disk files
Master Thread Asynchronously refreshes data to disk, ensuring data consistency
IO Thread Handles callbacks for asynchronous write IO requests
Purge Thread Recycles undo pages no longer needed after transaction commits
Page Cleaner Thread Completes dirty page refresh operations in separate threads
Buffer pool can be compared to kernel page cache Pages are flushed back to disk using checkpoint mechanism Types of pages in the pool: index pages, data pages, undo pages, insert buffer, adaptive hash index, lock information, data dictionary information
Multiple buffer pool instances exist, each page allocated to different caches based on hash value to reduce internal resource competition. Default buffer pool page size is 16KB, managed using LRU
Full table scans may cause LRU list pollution
Checkpoint
Used to solve:
- Shorten database recovery time
- When buffer pool is insufficient, flush dirty pages back to disk
- When redo log is unavailable, flush dirty pages
When database crashes, it doesn’t need to replay all logs, only needs to recover redo logs after checkpoint
When too many dirty pages cause insufficiency, checkpoint is forced to flush back to disk
When redo log needs to be reused, trigger checkpoint
Two types:
Sharp Checkpoint: Flush all dirty pages back to disk when database shuts down Fuzzy Checkpoint: Only flush part of dirty pages, entire process is asynchronous Monitor disk IO load conditions, perform refresh when load is low
Insert Buffer
Inserting clustered index (primary key) is generally sequential (auto-increment), no random disk reads required, thus fast insertion speed Other secondary indexes are non-clustered, requiring random reads during insertion due to B+ tree structure
For non-clustered index insertion:
- Determine if the inserted non-clustered index page is in buffer pool, if so, insert directly
- If not, temporarily store write requests, then merge them through a certain frequency between
insert bufferandsecondary index page child nodes, usually merging multiple insertions into one operation (within an index page), greatly improving insertion performance
Similarly, there are corresponding buffers for deletion and modification operations
Internally implemented using B+ tree, stored in shared tablespace, responsible for insert buffering of secondary indexes across all tables
Double Write
Doublewrite simultaneously writes a copy of a page to prevent data loss. When write failure occurs, first restore the page through its copy, then perform redo
Asynchronous IO
The biggest advantage is enabling IO Merge to improve IO performance Natively supported by kernel Dirty page refresh, disk writing, and read-ahead style reading are all completed through AIO
Indexes
iostat showing 100% disk usage may indicate excessive indexes have been added
B+ Tree Index
This type of index cannot find specific rows with given key values, can only find the page containing the searched data rows, then the database reads the page to memory and searches within memory
B+ tree performs extensive page splits during insertion while maintaining balance, therefore B+ tree can perform rotation-like operations to minimize page splits as much as possible
Number of levels generally maintained at 2-4 layers, mechanical disks can perform at least 100 IOs per second, meaning query time is approximately 0.02 - 0.04 seconds
The difference between clustered and secondary indexes lies in whether the leaf nodes contain complete row information
Clustered Index
Leaf nodes contain complete row record data of the entire table, leaf nodes called data pages, each data page linked through a bidirectional linked list
Secondary Index
Leaf nodes besides key values, contain a pointer to row primary key value. Therefore, the number of IO operations required for secondary index to find data pages is secondary index levels + clustered index levels
Heap table: Stored according to insertion order. At this time, pointers can be row identifiers, enabling faster location of row data
B+ tree index applicable scenarios: Suitable when accessing very little of the table, only high selectivity fields can have indexes added. High selectivity viewed through Cardinality value, which represents the estimated count of unique records in the index
Composite Index
Indexing multiple fields simultaneously, also a B+ tree. Leaf node key value groups will be sorted, if stored as (a,b) format, then where a = xx can also use this index, but where b = xx cannot because it’s unordered, only the second column of leaf nodes is ordered when the first column is determined
Covering Index
Obtain query records from secondary index without needing to query clustered index records, much smaller than clustered index, can reduce large amounts of IO operations
Why optimizer doesn’t choose index?
Often occurs with range lookups, JOIN table scenarios
Adaptive Hash Index (AHI)
Suitable for equality lookups Engine observes that creating hash index can bring speed improvements, then creates hash index. Automatically creates hash indexes for certain hot pages based on access frequency and patterns
Conditions:
- Continuous access pattern to this page must be the same
- Accessed 100 times with this pattern
- Page accessed N times through this pattern
Inverted Index
Used for full-text search, uses red-black tree as full-text search index cache, functions similarly to insert buffer
Locks
latch: Lightweight lock, requires very short locking time, no deadlock detection and handling, only ensures no deadlocks through locking order, exists in each data structure object | Mutex and read-write locks
lock: Object is transaction, used to lock database objects such as tables, pages, rows General lock only released after transaction commit or rollback, related to transaction isolation level, has deadlock detection and handling. Exists in hash table of lock manager
Row-level Lock
Shared lock S Lock, allows transaction to read one row of data Exclusive lock X Lock, allows transaction to delete or update one row of data Only S Locks are compatible with each other InnoDB supports multi-granularity locking, allowing both row-level and table-level locks to exist simultaneously within a transaction
Intention lock: Divides locked objects into multiple levels, means transaction wishes to lock at finer granularity Intention Shared lock IS Lock: Transaction wants to obtain shared locks on several rows in a table Intention Exclusive lock IX Lock: Transaction wants to obtain exclusive locks on several rows in a table
Lock hierarchy: Database | Table1, Table2 … | Page1,2… | Record 1,2….
If wanting to add X lock to record 1, then must add intention exclusive locks to all its ancestors If other incompatible locks exist, then need to wait
Consistent Non-locking Read
Read current executing time database row data through multi-version control. If current row is being updated or deleted, read operations won’t wait for row lock release, instead directly read snapshot data
Under default transaction isolation levels READ COMMITTED and REPEATABLE READ, default is non-locking consistent read, former always reads latest version of snapshot data, latter always reads row data version at transaction start
Consistent Locking Read
In some cases we need strong logical consistency, statement level supports consistent locking read SELECT … FOR UPDATE (add X lock) SELECT … LOCK IN SHARE MODE (add S lock, other transaction adding X lock will be blocked) Must be used within a transaction
Auto-increment
AUTO INCREMENT columns have locks on auto-increment counters, a special table lock, immediately released after completing auto-increment insertion, new versions use lightweight mutex implementation
Foreign Keys
InnoDB automatically adds indexes to foreign key columns, this avoids table locks Foreign key value insertion or updates first query parent table records, therefore require locking reads to prevent data inconsistency
Row Locks
- Record Lock: Lock on single row record
- Gap Lock: Locks a range but does not include the record itself
- Next-key Lock: Combination of previous two, locks a range and the record itself
For row queries, generally use the 3rd type
Phantom Read Problem
Refers to same transaction continuously executing twice identical SQL statements may result in different outcomes, second SQL statement may return previously non-existent rows. Violates transaction isolation, one transaction can see another transaction’s results
Solved using Next-key Locking, not only locks values but also locks query ranges, adding X lock to ranges, therefore modifications to this range are not allowed, solving the problem Adopted under REPEATABLE READ transaction isolation level
Dirty Read
Dirty data: Transaction modifies row records in buffer pool but hasn’t committed yet. Refers to uncommitted modified data If another transaction reads uncommitted data, it violates database isolation Occurrence condition: Transaction isolation level is READ UNCOMMITTED Application scenario: Slave node replicas, and queries on this node don’t require particularly accurate return values
Non-repeatable Read
Difference from dirty read: Dirty read reads uncommitted data, non-repeatable read reads committed data Generally non-repeatable read is acceptable since it’s already committed Level: Allowed under READ COMMITTED
Lost Update
One transaction’s update causes another transaction’s update operation to be overwritten, resulting in data inconsistency Solution: Transaction operation serialization
Deadlock
Brute force method: Rollback and restart any wait, but poor performance
Timeout mechanism: When waiting time exceeds threshold, one transaction rolls back, another waiting transaction continues Wait graph: DFS deadlock detection, more proactive, existence of cycle indicates deadlock detected, choose to rollback transaction with minimum undo volume
Lock Escalation
InnoDB has no lock escalation, manages locks based on each page accessed by each transaction, using bitmap. Regardless of how many locks, overhead remains consistent
Transactions
Either all modifications are saved, or none are saved A Atomicity C Consistency, before and after transaction begins and ends, database integrity constraints are not violated I Isolation D Durability, once transaction commits, results are permanent, even crash can recover data
Flat transaction, flat transaction with savepoints Chained transaction Nested transaction Distributed nodes, running transactions between different nodes
Implementation
redo: Contains redo log buffer and redo log file, used to ensure durability, sequential write undo: Helps transaction rollback and MVCC, requires random read/write
Ensure buffered log file writes, call fsync each time, prevent content staying in file system buffer Log files stored in blocks (512 bytes), split when exceeded During recovery only consider log portion starting from checkpoint
Transaction Isolation Levels
READ UNCOMMITTED Browse protection (only for transactions) READ COMMITTED No phantom read protection REPEATABLE READ Solves phantom read SERIALIZABLE
Lower isolation level, fewer locks requested by transactions or shorter lock holding time
Backup and Recovery
- Hot backup: Online backup
- Cold backup: Offline backup
- Warm backup: Online backup with locks
Complete backup, incremental backup, log backup