Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ricecooker.utils.pdf.PDFParser#get_toc reimplemented to use pdftk #324

Open
wants to merge 12 commits into
base: develop
Choose a base branch
from

Conversation

nucleogenesis
Copy link
Member

Fixes #312

WIP - Need to make a test that fails the old get_toc implementation (temporarily renamed to get_old_toc) but passes the new one to be sure that I'm addressing the issue noted about PDFs created with wkhtmltopdf.

Currently:

  • Newly implemented get_toc passes existing test suites.
  • get_toc uses NamedTemporaryFile and hopefully in a cross-OS compatible way (I only accessed it using the .name property once and close the file at the end anyway).
  • I didn't see the need to update any other methods since the others mentioned by @rtibbles in PDF Files created with wkhtmltopdf cannot have their outline parsed #312 call get_toc for their data.

Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One update to start with - my suggestion in the issue was just to bite the bullet and use pdftk for all of our PDF operations, so we could drop reliance on the unmaintained Python library.

def get_toc(self, subchapters=False):
temp_file = NamedTemporaryFile()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to use mkstemp here - on windows trying to access a NamedTemporaryFile by referencing its name will give you a permissions error.

Also then need to explicitly close the file handle returned by mkstemp to ensure proper cleanup.

See here for an example:

fdin, temp_in_file_name = tempfile.mkstemp()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One update to start with - my suggestion in the issue was just to bite the bullet and use pdftk for all of our PDF operations, so we could drop reliance on the unmaintained Python library.

There is one issue here where we'll break the PDFParser API which I mentioned in my comment on the linked issue: #312 - the PDFParser.pdf field is (in code) a public property. It isn't mentioned in the docs though. I'm okay with moving forward this way but feel it just needs to be noted that some peoples' chefs might break if they've managed to rely on that property at all - which they might because it is a PyPDF2 instance.

Will continue on with fully wrapping pdftk - but as we are effectively not doing anything to improve the software, I also suggested we consider just catching the error and informing the user of the limitations of PDFParser - which is that we cannot help them if their PDF doesn't have the proper metadata. This would be equally robust a solution if it can be worked out and ostensibly simpler.

We have to do something like this anyway if we use pdftk.

Happy to go either way just wanted to hear your thoughts on the above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - its worth noting that the change I made is basically removing the meat of PyPDF2.

What remains is some clean-up and documentation updates to finalize the removal of PyPDF2. Also - I need to catch and raise a meaningful error message when we get a PDF without an outline.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - I think catching errors and giving meaningful error messages is an improvement :)

@rtibbles
Copy link
Member

I also recently came across this: https://github.com/jorisschellekens/borb - not sure if it would help or not.

- Used subprocess to call pdftk
- The output is written to and then read from a tempfile - this is done to avoid issues where CLI tools may at times output unexpected lines that could cause issues for parsing. 
- Used NamedTemporaryFile and only accessed the file with the `NamedTemporaryFile.name` property once - should be compatible across all OS implementations. Also the file is closed at the end of the process.
- This is only done such that it passes the existing test suite - so it assumes that the suite is robust enough at this point to cover some edge cases
@nucleogenesis
Copy link
Member Author

nucleogenesis commented Aug 18, 2021

Trying borb will probably be necessary. I think I know what the issue is - the latest Mac OS (10.15ish) doesn't run 32-bit applications and pdftk only targets back to 10.9 so I think it's probably running into that issue? I can't say for sure but it jives with the "bad CPU type" error we're seeing.

I can give borb a look as an alternative for sure - as long as it breaks up the chapters similarly to PyPDF2 and pdftk it ought to be a relatively quick refactor

@nucleogenesis
Copy link
Member Author

re: borb - I played with it a bit this AM and here are some notes for when returning here.

load a file like:

with open('path/to/file', 'rb') as f:
    doc = borb.pdf.pdf.PDF.loads(f)
    outlines = doc["XRef"]["Trailer"]["Root"]["Outlines"] # is where the outlines live
    
    from borb.io.read.types import Name, String, Dictionary
    
    # when you look into outlines, you'll see a dictionary of objects of thoes Name, String, Dictionary types
    # which means you have to look at the source to know how to get the values from them like String you call
    # __str__() on it... it's a mess. This should be enough of a head start upon returning.

@nucleogenesis
Copy link
Member Author

I played a bit more with borb and it turns out that the metadata it does pull from a PDF is currently incomplete - only extracting a subset of the outlines metadata that both PyPDF2 and pdftk extracted successfully.

@nucleogenesis
Copy link
Member Author

cpdf is free for non-commercial use and does everything we want.

It was a pretty quick shift - but now tox is mad about subprocesses. I wanted to capture stderr during dev so I used subprocess.run() and didn't write to a tmp file like before. Not sure if this is the cause, but the error is coming up in subprocess.Popen - so despite finding a tool that works for our needs very well, back at square one with failing tests on the binaries.

Will come back to this again at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PDF Files created with wkhtmltopdf cannot have their outline parsed
2 participants