Want to improve this question? Update the question so it’s on-topic for Stack Overflow.
Closed 6 years ago.
Can anyone recommend a library/API for extracting the text and images from a PDF? We need to be able to get at text that is contained in pre-known regions of the document, so the API will need to give us positional information of each element on the page.
We would like that data to be output in xml or json format. We’re currently looking at PdfTextStream which seems pretty good, but would like to hear other peoples experiences and suggestions.
Are there alternatives (commercial ones or free) for extracting text from a pdf programatically?
Related Posts:
- How can I extract embedded fonts from a PDF as valid font files?
- Recommended way to embed PDF in HTML?
- How can I read pdf in python?
- Python module for converting PDF to text
- Is it possible to embed animated GIFs in PDFs?
- How to extract data from a PDF file while keeping track of its structure?
- IPython/Jupyter Problems saving notebook as PDF
- How to display PDF in a new tab instead of downloading? [closed]
- attach a PDF to an archives template?
- Why embedded PDF documents sometimes failed to load on my website
- Print Cforms form as pdf
- Export whole wordpress blog to PDF or similar including images [duplicate]
- (Only on Firefox) Why links to pdfs on my website ask me whether I want to save file?
- Is there a publishing platform that can assemble various rss feeds into a single PDF newsletter for a community? [closed]
- ACF file – How to force download instead of open in the browser
- Can I progressively load a page that has lots of PDF fies on it?
- XQuery data and text() function
- CSS no text wrap
- Difference between VARCHAR and TEXT in MySQL
- Inserting a PDF file in LaTeX
- Create PDF file using PHP
- How can I create a text input box with Pygame?
- Proper MIME media type for PDF files
- Merge / convert multiple PDF files into one PDF
- Duplicate headers received from server
- Text border using css (border around text)
- How to replace multiple substrings of a string?
- How to extract a substring using regex
- strange text before my header wordpress
- Setting a max character length in CSS
- Text wrapping around a div
- How to make PDF file downloadable in HTML link?
- Reading local text file into a JavaScript array
- can I add a custom format to the format option in the text panel?
- How to retrieve text only from wp_content() not from wp_excerpt()?
- How to Create Editable Blocks of Text for the Homepage?
- Problem in wordpress with “-“
- How to create thumbnails for PDF uploads?
- Is it possible to create an “export to PDF” option?
- TCPDF get_post_meta outside the loop
- automatic PDF invoice with FPDF in PHP (creating Plugin)
- Export a blog(not mine) as a PDF document
- PDF Image in content
- Restrict PDF links
- how do i embed the pdf gallery in wordpress post
- Remove text tab
- Building WordPress Plugin Using FPDF – How do you get post content from currently viewed post?
- string translation in functions.php not working
- Escaping Quotes
- Custom color does not stick
- Make page tab link to pdf
- Add default text to multiple wysiwyg editors
- Preserve shortcode content formatting
- How can I highlight specific pieces of text within a blog post?
- How to load a post into an empty div tag anywhere across the pages?
- How to do string attachment with wp_mail
- PhantomJS with wordpress
- Embed interactive pdf
- Text Stating the Domain Name Appears on Every Page… How to Get Rid of It [closed]
- Strange ASCII characters overlapping content
- Change default italic from to in admin editor
- Sidebar widget: Randomly select text from a given set
- How to remove ‘wordpress…’ text from page titles in tabs
- A link (not in the post) to download a specific PDF file
- Adding a line of text to php code
- How can style text like this in wordpress
- Add pdf to a website
- Using a .pdf file as a page in wordpress
- How to change default text for specific post type
- How to style text in WordPress
- How to paste into WordPress editor without changing existing formatting in editor?
- How to make blog post entries appear as input form instead of just text?
- Why cant I change the text on this theme?
- Will wp_get_attachment_metadata display PDF metadata (eg: keyword, author, description)?
- Pdf visualiser embedded into wordpress website
- Archive limit the text of the_content
- WordPress managing dates that change in text regularly
- How to hide particular plain text with link from different subscribers
- PDF Upload from Input Error
- Space between text elements (title and content) in text editor
- FPDF for creating pdf diplomas
- Is there a way to add a featured image to an image/file attachment page?
- ACF Flexible in TCPDF
- How to display text on product catalog
- Change values on several pages
- Text widget is placing everything side by side. I want to post it above
- Image alt attribute
- How do i create a search option for pdf’s only
- How do i edit text that is not in the customization menu
- Create Custom Post Type PDF File (Like Media) Then Add Searchable Custom Fields
- Header Image instead of dynamic text
- How can I allow users (subscribers) to download selected posts into a single PDF? (RESOLVED)
- Weird google bot crawl problem
- Add text to Text Widget using Javascript
- How to hide image alt text/caption in Visual Composer?
- Open all PDF or docx link as iframe
- I’ve been trying for an hour to remove the underline from links; I don’t understand why something that should be so simple is so difficult [closed]
- How can I copy value from dropdown once I select it to text area as text?
- How do I grep through binary files that look like text?
- Custom Search on media files PDF images pages posts