openGauss database SQL engine
openGauss Database executor technology
openGauss Storage technology
One 、openGauss Storage overview
Two 、openGauss Row storage engine
Ⅰ、 Overall architecture of row storage engine
Ⅱ、 The basic model of row storage and page organization structure
Ⅲ、 Multi version management of row storage and DML operation
Ⅳ、 be based on CSN Of MVCC Mechanism
Ⅴ、 Reclaim row storage space
Ⅵ、 Shared cache management for row storage
Ⅶ、 Design of parallel log system
Ⅷ、 Persistence and fault recovery system design
3、 ... and 、openGauss Column storage engine
Four 、openGauss Memory engine
openGauss Transaction mechanism
openGauss database security
openGauss Storage technology
Two .openGauss Row storage engine
Shared cache management for row storage 06
Mentioned earlier , Row storage is a disk based storage engine . for fear of IO High cost of , The storage engine caches some pages in memory , It is easy to retrieve and change it at any time . The storage engine filters the cached pages 、 Replace and eliminate , Ensuring that the pages stored in the cache can improve the execution efficiency of the whole engine .
There are also many kinds of caches in row storage , In addition to the cache of normal data pages , There are also methods for caching meta information of various tables Relation Cache（ Data table cache ）, And a method for accelerating the operation of database system information and system tables Catalog Cache（ System table cache ）. These kinds of caches are in the form of page（ page ） The form of is managed by the shared buffer structure .
The shared buffer consists of a large number of page slots , The slot itself has a corresponding description structure , And page level locks for managing concurrent operations in this operation , And it is equipped with a free linked list for free space management . Pictured 19 Shown .
chart 19 Shared buffer
Read and write requests for transactions in the row storage engine , Will be passed to the shared buffer first . Requests for a page are now searched in the buffer , If you miss , Get an empty slot （ You may need to weed out pages that are not commonly used in the buffer ）, Then interact with the file system and read the required page into the slot , Lock and use . According to the characteristics and load of the business 、 And the size of the shared buffer , Data pages that are already in the buffer will be hit repeatedly , Avoid the connection with disk IO expenses , So as to speed up the whole transaction process .
Changes to the page are also placed in the cache and marked as dirty pages . At this time, the background write thread （background writer） Clean and swipe the dirty pages regularly , Return space to buffer . On the other hand ,Checkpoint, That is, checkpoint operations , All the pages will also be swiped in progress , Ensure data persistence . A concept that needs attention here is , When a transaction is committed , The pages changed during the execution of this transaction are not necessarily flushed to disk , The persistence mechanism of the transaction itself is actually forced by the transaction WAL, That is to say xlog, To ensure that the . stay Checkpoint After the operation , Because the relevant pages have been persisted to disk , therefore Checkpoint Before time point xlog, It can be recycled . This mechanism will continue in subsequent chapters .
Shared buffer is actually the core mechanism for coordinating management and scheduling in memory and persistent storage , It has a great impact on the efficiency of database management system . In order to further improve the hit rate of pages in the buffer , Some operations that may affect the relevance of pages in the buffer to the business , Will use a separate buffer ,Ring Buffer（ Ring buffer ）. Batch read 、 Batch write 、 as well as Vacuum Page cleanup , All belong to this kind of operation .
Design of parallel log system 07
The log system of database is very important , It is the key guarantee of data persistence . Traditional databases generally adopt the design of serial log brushing , Because logs have order dependencies , for example ： One is generated by transactions REDO/UNDO Logs are dependent on each other .openGauss The log system uses multiple LOG WRITER（ Log write thread ） The mechanism of thread parallel writing , Give full play to SSD Multichannel IO Ability . Pictured 20 Shown .
chart 20 Schematic diagram of parallel log brushing
Key design ：
§ Of the whole transaction wal The log cannot be split into multiple transaction log shared buffers , Must write to a transaction log shared buffer .
§ Fault recovery wal, Parallel recovery , Must follow LSN Size order recovery .
§ Before each transaction ends, you need to ensure the corresponding transaction log LSN The disc brushing has been completed .
§ Consider allocating transaction log shared buffer for transactions NUMA Architecture adaptation .
Persistence and fault recovery system design 08
The log system of database is very important , It is the key guarantee of data persistence . Based on transaction ID Based on multi version management and the accumulation and cleaning methods of historical versions , Row storage engine mainly uses Redo journal （ As mentioned above XLOG） As the main means of persistence , With incremental checkpoints （Checkpoint） And parallel playback of logs , Support fast fault recovery of database instances .
- The transaction Redo Logging mechanism
Redo Logs are generated when transactions modify data , It is used to record the modified data of the transaction 、 Or the specific operation of the transaction on the data . such as , ordinary INSERT/UPDATE/DELETE The operation will produce the following results 21 Shown Redo journal .
chart 21 Redo journal
Some key operations that are not directly modified by transactions will also be recorded in Redo journal , For example, apply for a new page 、 Explicit transaction commit 、 checkpoint （Checkpoint） etc. . Record Redo The principle of logging , After the database fails , You can start with the last checkpoint , adopt Redo Log playback , Restore to the same state as before the failure of the database instance .
Redo Logs should be used in addition to data recovery , Data backup 、 Restore and synchronization between active and standby database instances 、 Different database instances / Synchronization between clusters depends on Redo Log mechanism . In order to ensure the consistency of data , Before swiping the relevant pages of transaction modification , You need to put the corresponding Redo Brush the log disk , That is to say WAL（Write Ahead Log） Principles .
Because the submission of transactions and the order between operations are crucial to data consistency , therefore Redo The log must also record this order . Every one of them Redo Each log has a log number , namely Log Sequence Number (LSN). In a row storage system ,LSN For an incremental 64 Bit unsigned integer . Various mechanisms in the system , As mentioned later, checkpoints , And the synchronization mechanism between active and standby instances 、 Arbitration mechanism , All need to rely on the LSN Or recovered LSN As an important mark or judgment basis .
- Full and incremental checkpoints
In the above description of transaction log and shared buffer , There is a key message , That is, the persistence of transaction log is synchronized with transaction commit , However, the persistence of page related changes within a transaction is not synchronized with the transaction submission ; in other words , Transaction submission needs to be related to this thing Redo Logs are forced to be flushed , However, it is not mandatory that the relevant pages are also forced to be swiped . When a database instance fails and restarts , The instance is in the process of starting , Changes that have not been able to swipe the disk in time before need to be recovered using the transaction log . But the cost of log playback is very high , The performance is also relatively slow . In order to avoid the need to restore the transaction log from scratch every time the database , The database itself periodically creates checkpoints , You can also manually create checkpoints through commands .
In the process of creating checkpoints , The storage engine will write dirty pages in the data buffer to disk , And record log files and control files . Record information rec LSN It means that in this checkpoint , Here it is LSN All changes corresponding to previous logs have been persisted , The next data recovery can be done directly from here LSN Start ; At the same time LSN Previous transaction logs , For other purposes （ The active and standby instances are synchronized 、 Data backup, etc ） when , It can also be recycled .
Because the checkpoint itself needs to brush all dirty pages in the buffer （ Full checkpoint ）, Therefore, each checkpoint will introduce a lot of changes to the physical environment where the database instance is located from the perspective of performance IO, The peak of disk often means the fluctuation of performance . At the same time, because there are a lot of IO expenses , Therefore, the management of checkpoints should not be too frequent ,rec LSN Slow propulsion , Then there will be more problems when restarting the database Redo Logs need to be played back , There is a problem that the restart recovery time is too long . To solve this problem , The row storage engine introduces the concept of incremental checkpoints .
Incremental checkpoint mechanism , Will maintain a dirty page queue （dirty page queue）. Dirty pages are according to LSN Put in the queue in ascending order , Periodically, a background thread dedicated to cleaning dirty pages pagewriter（ Page disk brushing thread ） Carry out regular and quantitative operation of brushing dirty pages and footwall . Pictured 22 Shown .
chart 22 Dirty page queue
Maintain a... In the queue rec LSN, Record the... Corresponding to the dirty page that has been scrubbed LSN size , That is, the transaction corresponding to the dirty page in the queue is committed 、 Its relative transaction log after the footwall , this rec LSN The tag will be updated . When an incremental checkpoint is triggered , You don't have to wait for the dirty page brush disk , Instead, you can use the current dirty page queue rec LSN As a checkpoint rec LSN Record . The incremental Checkpoint The existence of makes the whole system IO Smoother , And the fault recovery time of the system is shorter , Higher availability .
- Parallel playback
Redo Log playback means that Redo Changes recorded in the log are reapplied to the system / The process in the page , This process usually occurs on the standby instance during instance failure recovery or data synchronization between the primary and standby instances （ That is, the change of the main instance , The standby instance also needs to be played back , To achieve the effect consistent with the state of the main instance ）. At present, the physical instance where the database is located often has more CPU nucleus , However, log playback is often operated by a single thread , In the process of log playback, the database instance cannot make full use of the physical environment resources .
In order to make full use of it CPU The characteristics of multi-core , Significantly speed up the recovery of database exceptions and the log playback of standby instances , The row storage engine plays back logs in a multi-threaded parallel manner , Pictured 23 As shown in Fig .
chart 23 Multi thread parallel playback log
The design of the whole parallel playback system adopts producer - Consumer model , The allocation module is responsible for parsing 、 Assign log to playback module , The playback module is responsible for consumption 、 Playback log .
To achieve this design , In the implementation, a lock free with blocking function is adopted SPSC(Single Producer Single Consumer) queue . The allocation thread, as the producer, puts the parsed log into the queue of playback threads , The playback thread plays back the consumption log from the queue . Pictured 24 Shown .
chart 24 unlocked SPSC queue
In order to improve the reliability of the overall parallel playback mechanism , In the playback action of a page , For... In the transaction log LSN And... In the page structure last_LSN（ See... Described in the previous section for details HeapPageHeader（ Heap page header ） Structure ） check , To ensure the consistency of the database system during playback .
To be continued .......