Jump to content

How to prepare pdf for AI (Notebook/Flashcard). Coping text from PDF with scientific symbols not working.

Featured Replies

Problem

  1. "explain entropy as a" is copied as "e x p l a i n e n t r o p y a s a". sometimes it "seems" like homoglyph-like character. example - letter "a" and the Cyrillic letter "а"

  2. there are random line breaks everywhere.

  3. Scientific symbols are not copied or copied as .

  4. specially super/sub-scripts.

  5. Sigma Symbol is not copied at all.

  6. Sometimes selecting is hard selecting formula selects everything or otherthings

  7. Superscript +/- are not copied.

  8. Arrow is not copied always, seems like sometype of DRM the book it using 2 different looking arrows.

  9. There is sometimes what seems to be hand written Symbols

  10. I copied "minus in a circle in superscript" to https://www.soscisurvey.de/tools/view-chars.php and it shows as U+F030, which https://www.compart.com/en/unicode/U+F030 as it for private use

Notes/Question

I have 2 problem -

  1. Not able to copy correctly

  2. Prepare pdf for LLM

Situation

  1. Text is copiable not using OCR.

  2. The text is already copiable but I want to add OCR layer to it how ?

  3. Make OCR ignore footer/header and page number

    image.png

  4. Example Pdf - The pdf is free to use for personal use but illegal to print. https://ncert.nic.in/textbook.php?kech1=5-6

What I Found

https://github.com/datalab-to/marker

3. What is the best way to convert the pdf to flashcard

  1. I tired https://anki-decks.com/app/dashboard/ but it limited to 25pages and

  2. doesn't ask the important things (doesn't understand the context for science to get formulas and tables)

  3. symbols are not working -image.png

Pdf weirdness

rich text editor image

Na Cl s Na g Cl g

( ) ;

image

1

2 ∆bond H = 121 kJ mol–1

Does this has to do it Unicode and pdf software ?

check what is actually copied using clipboard viewer. My guess is that the text is actually using a 2-byte encoding, probably UTF-16, but font doesn't have ToUnicode entry in font dictionary, so Acrobat doesn't know how to turn the bytes back into "information". So it's just giving you the raw bytes, like 00 65 for the 'e'. With a ToUnicode table, during text extraction Acrobat would know to turn the 00 65 back into just an 'e'. But without that, Acrobat doesn't know what that stream of bytes represents. That's because PDF isn't limited to fixed or pre-defined text encodings - it can be whatever you define in PDF file. But if you want to be able to extract text, you have to use something standard, or provide a ToUnicode table to turn the bytes into information.

Edited by HbWhi5F
added flashcard generator program

How to prepare pdf for AI

It's difficult to understand. Why should you prepare a PDF for “AI” (probably LLM??) and not the other way around? People want LLM to generate a PDF for them, not the other way around...

Besides, LLM doesn't mean anything, i.e., it means too much. You should talk about preparing for ChatGPT, Deekseep, etc. It may look different for everyone depending on their capabilities and version, and whether you have a paid or free plan. In the free, unregistered ChatGPT, you can't upload your files to it, so it won't even do OCR for you, for example.

Another issue is that PDF is a very complex format. You can have embedded images from a scanner that look like they were generated by an engine, but they cannot be easily parsed.

i.e., the same parser will work fine with one PDF, but will not work with another PDF.

Please sign in to comment

You will be able to leave a comment after signing in

Sign In Now

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.