2. Add black rectangle/redact, and save again as raster image, preferably in a lossless way
3. Export as PDF, if you need that. Make sure that you've checked and/or erased all metadata from step 1 that is easily found as text (hidden layers or text in metadata, for example). For common raster formats such as PNG or JPG, this should amount to briefly checking metadata and/or strings output.
Is there anything else that a "PDF redactor" should do?
And are we sure that this one does all the steps?
If you like to be paranoid: a universal removal tool for steganographically stored info is theoretically impossible.
Appreciate the feedback. The steps you listed are essentially what the site is doing. Upload a PDF, add the black boxes, it gets converted to PNG and back to a new PDF. The value of this tool is just to streamline that process to make it quicker and easier.
The point about metadata is a good one, I checked a test file that I used and you can't see metadata from the original PDF, you only see basic info about the new PDF file and that it was produced by pdf-lib.
There definitely could be other things that a redactor should do, but for most use cases I think steganographically stored info lives outside of the threat model.
edit: ran strings on the output file, nothing but PDF structure and compressed image data, no original text content - thanks for the suggestion.
curious whether metadata survives the PNG roundtrip. things like original creation dates, software used, or embedded thumbnails can still leak info even in rasterized PDFs. might be worth adding a strip step if you isnt already doing it
Good point, just pushed a fix. Title, author, subject, keywords, producer, creator, creation date, and modification date are now explicitly stripped from the output file metadata.
It seemed that these were already removed when the PDF was rasterized, but now they're explicitly being removed.
Just open sourced it: github.com/mr-guac/redactpdf
For your friend's air gapped environment, the file works offline after the libraries cache on first load, but it does pull PDF.js and pdf-lib from CDN so a one-time internet connection is needed.
To run it fully offline you'd need to download those two libraries separately, transfer them to the air gapped machine, and swap the CDN links in the HTML to point to the local files instead.
Why go through these hoops instead of
1. Export as PNG (or whatever you prefer)
2. Add black rectangle/redact, and save again as raster image, preferably in a lossless way
3. Export as PDF, if you need that. Make sure that you've checked and/or erased all metadata from step 1 that is easily found as text (hidden layers or text in metadata, for example). For common raster formats such as PNG or JPG, this should amount to briefly checking metadata and/or strings output.
Is there anything else that a "PDF redactor" should do?
And are we sure that this one does all the steps?
If you like to be paranoid: a universal removal tool for steganographically stored info is theoretically impossible.
Appreciate the feedback. The steps you listed are essentially what the site is doing. Upload a PDF, add the black boxes, it gets converted to PNG and back to a new PDF. The value of this tool is just to streamline that process to make it quicker and easier.
The point about metadata is a good one, I checked a test file that I used and you can't see metadata from the original PDF, you only see basic info about the new PDF file and that it was produced by pdf-lib.
There definitely could be other things that a redactor should do, but for most use cases I think steganographically stored info lives outside of the threat model.
edit: ran strings on the output file, nothing but PDF structure and compressed image data, no original text content - thanks for the suggestion.
curious whether metadata survives the PNG roundtrip. things like original creation dates, software used, or embedded thumbnails can still leak info even in rasterized PDFs. might be worth adding a strip step if you isnt already doing it
Good point, just pushed a fix. Title, author, subject, keywords, producer, creator, creation date, and modification date are now explicitly stripped from the output file metadata.
It seemed that these were already removed when the PDF was rasterized, but now they're explicitly being removed.
Is this open source?
I have a friend who works in an air gapped environment that this would work for him.
Can't use this if it isn't open source.
Just open sourced it: github.com/mr-guac/redactpdf
For your friend's air gapped environment, the file works offline after the libraries cache on first load, but it does pull PDF.js and pdf-lib from CDN so a one-time internet connection is needed.
To run it fully offline you'd need to download those two libraries separately, transfer them to the air gapped machine, and swap the CDN links in the HTML to point to the local files instead.