Fixing corrupted data in PostgreSQL databases

Print Friendly, PDF & Email

Recently we noticed that the nightly backups for the BOS database were failing with the following error:

pg_dump: ERROR:  missing chunk number 0 for toast value 329523104
pg_dump: SQL command to dump the contents of table “rt_sessions” failed: PQendcopy() failed.
pg_dump: Error message from server: ERROR:  missing chunk number 0 for toast value 329523104
pg_dump: The command was: COPY public.rt_sessions (sessionid, surveyid, timedate, remoteipnumber, pausedata, status, manifestitemid, username, sessionvalue) TO stdout;

Essentially this means there’s corrupted data in one or more fields within the rt_sessions table. The database is vacuumed daily so that wasn’t fixing the problem. The easiest way to fix this is to find the affected data and remove/update it.

The affected table had about 3.5 million rows so finding the corrupted data is best done by repeated, targeted SELECT statements. The toast value above doesn’t help find the row AFAICT. So, you start off by searching half the data:

SELECT * FROM rt_sessions LIMIT 1750000;

If the above doesn’t produce the error:

ERROR: missing chunk number 0 for toast value 329523104

then you know the corrupted data is in the other half.

This can be confirmed by doing:

SELECT * FROM rt_sessions LIMIT 1750000 OFFSET 1750000;

and seeing if it errors.

With each SELECT you narrow the range of rows you’re searching across, halve it each time:

SELECT * FROM rt_sessions LIMIT 1750000 OFFSET 1750000;
ERROR: missing chunk number 0 for toast value 329523104
SELECT * FROM rt_sessions LIMIT 875000 OFFSET 1750000;
ERROR: missing chunk number 0 for toast value 329523104
SELECT * FROM rt_sessions LIMIT 437500 OFFSET 1750000;
ERROR: missing chunk number 0 for toast value 329523104
SELECT * FROM rt_sessions LIMIT 218750 OFFSET 1750000;
ERROR: missing chunk number 0 for toast value 329523104

and so on.

Once you’ve found the affected row do repeated SELECTs to determine the affected field:

SELECT surveyid FROM rt_sessions WHERE sessionid=2554424;

{gives data}

SELECT surveyid, timedate FROM rt_sessions WHERE sessionid=2554424;

{gives data}

SELECT surveyid, timedate, sessionvalue FROM rt_sessions WHERE sessionid=2554424;
ERROR: missing chunk number 0 for toast value 329523104

So you know the corrupt data is in ‘sessionvalue’ and can fix as appropriate. For example:

UPDATE rt_sessions SET sessionvalue=” WHERE sessionid=2554424;

Check the data has been fixed and/or there are no more corrupted fields by checking the below returns data and not an error:

SELECT * FROM rt_sessions;

If you get more errors then repeat the process to fix them too.

Two important notes

  1. the corruption means the SELECT should fail BEFORE printing results to the screen. if the SELECT starts producing data then the corrupted data isn’t within the set of data being returned, Ctrl+C and edit the SELECT accordingly.
  2. You have to SELECT the affected field for the SELECT to fail. If you do a specific SELECT on one field and that field isn’t corrupted, even if data in the same row is, then you won’t get an error.

State of backups with corrupted data:

If you’ve got corrupted data in a specific table then the pg_dump files can still be used to restore a database BUT the affected table will have NO DATA.

Forensics…

In this case it seems like the corrupted data was caused by clients trying to write to the table during a short period of instability in the box running the database.


About this entry