Updating Millions of Rows in PostgreSQL with Parallel Database Clusters

For a large database transform and migration project back in November, the requirement arose to populate a column in 100+ million records with values referencing across multiple other tables in different schemas.

I wrote a Pl/pgSQL function to perform the lookup, and - as the value would be reused across a batch of 100-or-so rows per lookup - insert the results of the lookup into a cache table.

The function looked something like this:

-- Example function, not the real code...
CREATE FUNCTION get_document_title(userid int)
  RETURNS TABLE(user_id int, document_label text)
  LANGUAGE plpgsql
AS $$
  cacheresult record;
  result record;

-- Check cache
SELECT COUNT(*) INTO cacheresult 
  FROM cache_table WHERE user_id = userid;

IF cacheresult > 0
  -- Result has been cached
    SELECT user_id, document_label
      FROM cache_table 
      WHERE user_id = userid
	  LIMIT 1;
  -- Perform the lookup. 
  -- In the real code, there are actually several branches for looking up from different sources, but for this demo, I've simplified:
  SELECT * INTO result
    FROM documents 
    WHERE user_id = userid 
      AND document_type_id = doctypeid 
      AND project_id IN (
        SELECT id 
          FROM projects 
          WHERE archived IS NOT TRUE
    LIMIT 1;

  -- Cache the result
  INSERT INTO cache_table (user_id, document_label) 
    VALUES (userid, result.document_label);

  -- Return the result
    SELECT document_label
      FROM result;

However, with over 100m rows to process, the resulting query was projected to run into days - if not weeks - of processing time. This was running on a fairly large Aurora PG cluster.

Whilst on a walk and not thinking about this problem, the idea suddenly occurred to me of splitting the process across parallel database clusters.

The source table had already been partitioned by year in an attempt to speed things up, but as each row was being updated in sequence, the time required was still too long.

So I set about using RDS to clone the source database - one for each year partition - and then starting the UPDATE individually on each cluster using a small bash command:

$ echo "UPDATE partition_table_2015 SET document_title = get_document_label(user_id) WHERE ..." | psql -h dbcluster_1;

Similarly, on the second DB cluster:

$ echo "UPDATE partition_table_2016 SET document_title = get_document_label(user_id) WHERE ..." | psql -h dbcluster_2;

...and so on.

With each DB cluster having its own resources (and writer instance), the plan was that this would drastically cut down the number of hours required to process the full table.

As each DB cluster finished updating its partition, the partition table was exported via pg_dump and then reimported back into the primary DB cluster:

# Export partition from each individual DB cluster:
$ pg_dump -Fc -h dbcluster_1 --table partition_table_2015 partition_2015.pgdump
$ pg_dump -Fc -h dbcluster_2 --table partition_table_2016 partition_2016.pgdump
$ pg_dump -Fc -h dbcluster_3 --table partition_table_2017 partition_2017.pgdump

# Import all the exports back into the Primary DB cluster:
$ pg_restore -h db_primary /tmp/partition_2015.pgdump
$ pg_restore -h db_primary /tmp/partition_2016.pgdump
$ pg_restore -h db_primary /tmp/partition_2017.pgdump

Whilst this process still took some time, parallelising the work across multiple DB clusters definitely sped things up, and meant that each partition could be independently verified before removing the cluster and importing the pgdump file back into the main DB cluster.

Like this post? Get notified when new articles are published.