Tuesday, January 2, 2024

Efficiently adding columns and constraints in high-volume PostgreSQL databases

Introduction

Managing a PostgreSQL database with billions of rows requires careful consideration, especially when modifying the schema. A common task like adding a new column with a foreign key constraint can be daunting in a production environment with constant write operations. This article provides a step-by-step guide to efficiently perform such updates with minimal impact on your production database.

Background

Recently a customer faced a challenge: they needed to add a column to several tables, each with billions of rows, and link these columns via a foreign key to a user table. The main concern was the performance impact during this process, especially regarding table locks and index creation.

Step-by-step solution

  1. Adding a new column: Initially, add the new column to your tables. This process requires an ACCESS EXCLUSIVE lock, but it’s usually fast as the column is initialized with NULL values, avoiding table rewrites.
  2. Creating a foreign key: Next, create the foreign key. Use the `NOT VALID` option to speed up the process and postpone validation, since the new column is still NULL.
  3. Updating the table: Update the new column with the required data. This step is time-consuming and locks the updated rows with a ROW EXCLUSIVE lock, preventing other processes from updating or deleting these rows. However, reading current data and inserting new data remains unaffected.
  4. Index creation: Once the updates are complete, create the index concurrently. This step doesn’t lock the table, allowing other read and write operations to continue. However, note that heavy write activity can slow down this process.
  5. Validating the foreign key: Finally, validate the foreign key constraint, which requires a SHARE UPDATE EXCLUSIVE lock but is generally fast.

Additional considerations

  • Disk space and vacuuming: Be aware of the additional disk space required during updates. It's advisable to run a vacuum analyze manually after completing the updates.
  • Autovacuum settings: Consider turning off autovacuum for the tables involved during these operations to prevent it from interfering with the process. Once done, manually vacuum analyze these tables and re-enable autovacuum.

Performance impact and testing

Using this method, the impact on production should be minimal. However, it’s crucial to test this process in a comparable test environment. Such testing not only ensures the smooth execution of each step but also provides an estimate of the time required.

Conclusion

Modifying large tables in a production PostgreSQL database requires careful planning and execution. By understanding the types of locks required and strategically managing the process, you can minimize the impact on database performance. Always test in a controlled environment to anticipate challenges and ensure a smooth transition in your production environment.

No comments:

Post a Comment

Pivot tables and FILTER in PostgreSQL for advanced data analysis

Introduction In the realm of PostgreSQL, two potent tools – pivot tables using the crosstab()  function and conditional aggregation with th...