Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RAFT] The chunking function doesn't separate PDF document pages in the chunks #585

Open
cedricvidal opened this issue Aug 15, 2024 · 0 comments

Comments

@cedricvidal
Copy link
Contributor

cedricvidal commented Aug 15, 2024

In the get_chunks function, the PDF pages are not separated, yielding chunks where spaces, dots or new lines are missing.

gorilla/raft/raft.py

Lines 90 to 92 in 2fc82a9

for page_num in range(num_pages):
page = reader.pages[page_num]
text += page.extract_text()

Unclear how to properly handle this situation in a generic way as depending on the context, this might require either a space, a new line or a dot.

This may impact the quality of the generated dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant