Updating Millions of Rows in PostgreSQL with Parallel Database Clusters
For a large database transform and migration project back in November, the requirement arose to populate a column in 100+ million records with values referencing across multiple other tables in different schemas.
I wrote a Pl/pgSQL function to perform the lookup, and - as the value would be reused across a batch of 100-or-so rows per lookup - insert the results of the lookup into a cache table.
The function looked something like this:
-- Example function, not the real code...
CREATE FUNCTION get_document_title(userid int)
RETURNS TABLE(user_id int, document_label text)
LANGUAGE plpgsql
AS $$
DECLARE
cacheresult record;
result record;
BEGIN
--
-- Check cache
SELECT COUNT(*) INTO cacheresult
FROM cache_table WHERE user_id = userid;
IF cacheresult > 0
THEN
--
-- Result has been cached
RETURN QUERY
SELECT user_id, document_label
FROM cache_table
WHERE user_id = userid
LIMIT 1;
ELSE
--
-- Perform the lookup.
-- In the real code, there are actually several branches for looking up from different sources, but for this demo, I've simplified:
SELECT * INTO result
FROM documents
WHERE user_id = userid
AND document_type_id = doctypeid
AND project_id IN (
SELECT id
FROM projects
WHERE archived IS NOT TRUE
)
LIMIT 1;
--
-- Cache the result
INSERT INTO cache_table (user_id, document_label)
VALUES (userid, result.document_label);
--
-- Return the result
RETURN QUERY
SELECT document_label
FROM result;
END IF;
END;
$$
However, with over 100m rows to process, the resulting query was projected to run into days - if not weeks - of processing time. This was running on a fairly large Aurora PG cluster.
Whilst on a walk and not thinking about this problem, the idea suddenly occurred to me of splitting the process across parallel database clusters.
The source table had already been partitioned by year in an attempt to speed things up, but as each row was being updated in sequence, the time required was still too long.
So I set about using RDS to clone the source database - one for each year partition - and then starting the UPDATE
individually on each cluster using a small bash command:
$ echo "UPDATE partition_table_2015 SET document_title = get_document_label(user_id) WHERE ..." | psql -h dbcluster_1;
Similarly, on the second DB cluster:
$ echo "UPDATE partition_table_2016 SET document_title = get_document_label(user_id) WHERE ..." | psql -h dbcluster_2;
...and so on.
With each DB cluster having its own resources (and writer instance), the plan was that this would drastically cut down the number of hours required to process the full table.
As each DB cluster finished updating its partition, the partition table was exported via pg_dump
and then reimported back into the primary DB cluster:
# Export partition from each individual DB cluster:
$ pg_dump -Fc -h dbcluster_1 --table partition_table_2015 partition_2015.pgdump
$ pg_dump -Fc -h dbcluster_2 --table partition_table_2016 partition_2016.pgdump
$ pg_dump -Fc -h dbcluster_3 --table partition_table_2017 partition_2017.pgdump
# Import all the exports back into the Primary DB cluster:
$ pg_restore -h db_primary /tmp/partition_2015.pgdump
$ pg_restore -h db_primary /tmp/partition_2016.pgdump
$ pg_restore -h db_primary /tmp/partition_2017.pgdump
Whilst this process still took some time, parallelising the work across multiple DB clusters definitely sped things up, and meant that each partition could be independently verified before removing the cluster and importing the pgdump
file back into the main DB cluster.
👋 Thanks for reading - I hope you enjoyed this post. If you find it helpful and want to support further writing and tutorials like this one, please consider supporting my work with a coffee!
Support ☕️