ricecooker.utils.pdf.PDFParser#get_toc reimplemented to use pdftk #324

nucleogenesis · 2021-03-29T18:27:15Z

Fixes #312

WIP - Need to make a test that fails the old get_toc implementation (temporarily renamed to get_old_toc) but passes the new one to be sure that I'm addressing the issue noted about PDFs created with wkhtmltopdf.

Currently:

Newly implemented get_toc passes existing test suites.
get_toc uses NamedTemporaryFile and hopefully in a cross-OS compatible way (I only accessed it using the .name property once and close the file at the end anyway).
I didn't see the need to update any other methods since the others mentioned by @rtibbles in PDF Files created with wkhtmltopdf cannot have their outline parsed #312 call get_toc for their data.

rtibbles

One update to start with - my suggestion in the issue was just to bite the bullet and use pdftk for all of our PDF operations, so we could drop reliance on the unmaintained Python library.

rtibbles · 2021-04-05T22:09:44Z

ricecooker/utils/pdf.py

    def get_toc(self, subchapters=False):
+        temp_file = NamedTemporaryFile()


Need to use mkstemp here - on windows trying to access a NamedTemporaryFile by referencing its name will give you a permissions error.

Also then need to explicitly close the file handle returned by mkstemp to ensure proper cleanup.

See here for an example:

ricecooker/ricecooker/classes/files.py

Line 776 in 8cbdf59

fdin, temp_in_file_name = tempfile.mkstemp()

One update to start with - my suggestion in the issue was just to bite the bullet and use pdftk for all of our PDF operations, so we could drop reliance on the unmaintained Python library.

There is one issue here where we'll break the PDFParser API which I mentioned in my comment on the linked issue: #312 - the PDFParser.pdf field is (in code) a public property. It isn't mentioned in the docs though. I'm okay with moving forward this way but feel it just needs to be noted that some peoples' chefs might break if they've managed to rely on that property at all - which they might because it is a PyPDF2 instance.

Will continue on with fully wrapping pdftk - but as we are effectively not doing anything to improve the software, I also suggested we consider just catching the error and informing the user of the limitations of PDFParser - which is that we cannot help them if their PDF doesn't have the proper metadata. This would be equally robust a solution if it can be worked out and ostensibly simpler.

We have to do something like this anyway if we use pdftk.

Happy to go either way just wanted to hear your thoughts on the above.

Also - its worth noting that the change I made is basically removing the meat of PyPDF2.

What remains is some clean-up and documentation updates to finalize the removal of PyPDF2. Also - I need to catch and raise a meaningful error message when we get a PDF without an outline.

Yep - I think catching errors and giving meaningful error messages is an improvement :)

rtibbles · 2021-07-26T15:06:11Z

I also recently came across this: https://github.com/jorisschellekens/borb - not sure if it would help or not.

- Used subprocess to call pdftk - The output is written to and then read from a tempfile - this is done to avoid issues where CLI tools may at times output unexpected lines that could cause issues for parsing. - Used NamedTemporaryFile and only accessed the file with the `NamedTemporaryFile.name` property once - should be compatible across all OS implementations. Also the file is closed at the end of the process. - This is only done such that it passes the existing test suite - so it assumes that the suite is robust enough at this point to cover some edge cases

This reverts commit 5c6234c.

nucleogenesis · 2021-08-18T14:08:18Z

Trying borb will probably be necessary. I think I know what the issue is - the latest Mac OS (10.15ish) doesn't run 32-bit applications and pdftk only targets back to 10.9 so I think it's probably running into that issue? I can't say for sure but it jives with the "bad CPU type" error we're seeing.

I can give borb a look as an alternative for sure - as long as it breaks up the chapters similarly to PyPDF2 and pdftk it ought to be a relatively quick refactor

nucleogenesis · 2021-08-18T18:03:13Z

re: borb - I played with it a bit this AM and here are some notes for when returning here.

load a file like:

with open('path/to/file', 'rb') as f:
    doc = borb.pdf.pdf.PDF.loads(f)
    outlines = doc["XRef"]["Trailer"]["Root"]["Outlines"] # is where the outlines live
    
    from borb.io.read.types import Name, String, Dictionary
    
    # when you look into outlines, you'll see a dictionary of objects of thoes Name, String, Dictionary types
    # which means you have to look at the source to know how to get the values from them like String you call
    # __str__() on it... it's a mess. This should be enough of a head start upon returning.

nucleogenesis · 2021-08-23T18:46:29Z

I played a bit more with borb and it turns out that the metadata it does pull from a PDF is currently incomplete - only extracting a subset of the outlines metadata that both PyPDF2 and pdftk extracted successfully.

nucleogenesis · 2021-09-01T19:45:11Z

cpdf is free for non-commercial use and does everything we want.

It was a pretty quick shift - but now tox is mad about subprocesses. I wanted to capture stderr during dev so I used subprocess.run() and didn't write to a tmp file like before. Not sure if this is the cause, but the error is coming up in subprocess.Popen - so despite finding a tool that works for our needs very well, back at square one with failing tests on the binaries.

Will come back to this again at some point.

nucleogenesis added the needs review label Mar 29, 2021

nucleogenesis requested a review from rtibbles March 29, 2021 18:27

nucleogenesis added the WIP label Mar 29, 2021

nucleogenesis mentioned this pull request Mar 30, 2021

PDF Files created with wkhtmltopdf cannot have their outline parsed #312

Open

rtibbles reviewed Apr 5, 2021

View reviewed changes

nucleogenesis added 8 commits August 11, 2021 10:38

Trying to ensure pdftk is installed in all testing envs

3632892

misspelling

55784fc

ps command for extracting exe for pdftk in gh workflow

b6d1e5f

getting the right pdftk

6cf86e0

getting the right pdftk this time

eb6b259

cleaned up

1c874d3

Fix spelling mistake in pythontest.yml

5c6234c

nucleogenesis force-pushed the pdftk branch from 8609f20 to 5c6234c Compare August 11, 2021 17:38

nucleogenesis added 2 commits August 18, 2021 06:56

trying something to fix the mac-os testing

cd75d1d

Revert "Fix spelling mistake in pythontest.yml"

ab61731

This reverts commit 5c6234c.

nucleogenesis added 2 commits September 1, 2021 12:22

use cpdf instead of pdftk

cc4acd9

try to get windows downloaded correctly

7c6fdea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ricecooker.utils.pdf.PDFParser#get_toc reimplemented to use pdftk #324

ricecooker.utils.pdf.PDFParser#get_toc reimplemented to use pdftk #324

nucleogenesis commented Mar 29, 2021

rtibbles left a comment

rtibbles Apr 5, 2021

nucleogenesis Apr 6, 2021

nucleogenesis Apr 6, 2021

rtibbles Apr 6, 2021

rtibbles commented Jul 26, 2021

nucleogenesis commented Aug 18, 2021 •

edited

Loading

nucleogenesis commented Aug 18, 2021

nucleogenesis commented Aug 23, 2021

nucleogenesis commented Sep 1, 2021

		def get_toc(self, subchapters=False):
		temp_file = NamedTemporaryFile()

ricecooker.utils.pdf.PDFParser#get_toc reimplemented to use pdftk #324

Are you sure you want to change the base?

ricecooker.utils.pdf.PDFParser#get_toc reimplemented to use pdftk #324

Conversation

nucleogenesis commented Mar 29, 2021

rtibbles left a comment

Choose a reason for hiding this comment

rtibbles Apr 5, 2021

Choose a reason for hiding this comment

nucleogenesis Apr 6, 2021

Choose a reason for hiding this comment

nucleogenesis Apr 6, 2021

Choose a reason for hiding this comment

rtibbles Apr 6, 2021

Choose a reason for hiding this comment

rtibbles commented Jul 26, 2021

nucleogenesis commented Aug 18, 2021 • edited Loading

nucleogenesis commented Aug 18, 2021

nucleogenesis commented Aug 23, 2021

nucleogenesis commented Sep 1, 2021

nucleogenesis commented Aug 18, 2021 •

edited

Loading