Want to improve this question? Update the question so it’s on-topic for Stack Overflow.
Closed 6 years ago.
Can anyone recommend a library/API for extracting the text and images from a PDF? We need to be able to get at text that is contained in pre-known regions of the document, so the API will need to give us positional information of each element on the page.
We would like that data to be output in xml
or json
format. We’re currently looking at PdfTextStream which seems pretty good, but would like to hear other peoples experiences and suggestions.
Are there alternatives (commercial ones or free) for extracting text from a pdf programatically?
Related Posts:
- How can I extract embedded fonts from a PDF as valid font files?
- Recommended way to embed PDF in HTML?
- How can I read pdf in python?
- Python module for converting PDF to text
- Is it possible to embed animated GIFs in PDFs?
- How to extract data from a PDF file while keeping track of its structure?
- IPython/Jupyter Problems saving notebook as PDF
- How to display PDF in a new tab instead of downloading? [closed]
- attach a PDF to an archives template?
- Why embedded PDF documents sometimes failed to load on my website
- Print Cforms form as pdf
- Export whole wordpress blog to PDF or similar including images [duplicate]
- (Only on Firefox) Why links to pdfs on my website ask me whether I want to save file?
- Is there a publishing platform that can assemble various rss feeds into a single PDF newsletter for a community? [closed]
- ACF file – How to force download instead of open in the browser
- Can I progressively load a page that has lots of PDF fies on it?
- sh ow different PDF based on URL param
- Convert HTML to PDF in .NET
- Print string to text file
- How do I find all files containing specific text on Linux?
- Failed to load PDF document in Chrome
- Editing legend (text) labels in ggplot
- XQuery data and text() function
- jQuery if div contains this text, replace that part of the text
- How to place a text next to the picture?
- How to display text in pygame?
- How to wrap text in LaTeX tables?
- Where does this come from: -*- coding: utf-8 -*-
- CSS no text wrap
- How to remove spaces from a string using JavaScript?
- How to display PDF file in HTML?
- Vertical Text Direction
- Difference between VARCHAR and TEXT in MySQL
- Inserting a PDF file in LaTeX
- What are the minimum margins most printers can handle?
- How to read a text file into a list or an array with Python
- Create PDF file using PHP
- Linking to a pdf file with html
- How can I create a text input box with Pygame?
- Indent starting from the second line of a paragraph with CSS
- How can I replace text with CSS?
- How to embed a PDF viewer in a page?
- align text center with android
- PCI\VEN_10EC&DEV_522A&SUBSYS_837D103C&REV_01
- Proper MIME media type for PDF files
- Merge / convert multiple PDF files into one PDF
- Duplicate headers received from server
- Text border using css (border around text)
- How to replace multiple substrings of a string?
- How to easily print ascii-art text?
- How to extract a substring using regex
- Making text background transparent but not text itself
- strange text before my header wordpress
- Print PDF directly from JavaScript
- How to place Text and an Image next to each other in HTML?
- How do you change text to bold in Android?
- How to split large text file in windows?
- Setting a max character length in CSS
- Align text to the bottom of a div
- Print Pdf in C#
- Text wrapping around a div
- Generating PDF files with JavaScript
- How can I change the text color with jQuery?
- Data extraction from /Filter /FlateDecode PDF stream in PHP
- How to make PDF file downloadable in HTML link?
- How to set text color for my d3 chart title?
- Reading local text file into a JavaScript array
- How can I align text directly beneath an image?
- How to Read from a Text File, Character by Character in C++
- Documentation for using JavaScript code inside a PDF file
- Stop WordPress automatically adding tags to post content
- How do I export my WordPress blog as a book? [closed]
- can I add a custom format to the format option in the text panel?
- How to retrieve text only from wp_content() not from wp_excerpt()?
- How to Create Editable Blocks of Text for the Homepage?
- Export all posts as individual plain txt files
- Where do the favicons for Media Files come from
- Convert WordPress pages to PDF
- Problem in wordpress with “-“
- How can I input a single right-to-left paragraph (Hebrew) into an English page/post?
- How to keep non-breaking spaces in the visual editor?
- Automatically decrease font size for long words
- How to create thumbnails for PDF uploads?
- Is it possible to create an “export to PDF” option?
- Attach pdf file to custom post type
- Load post with a different template?
- How to make shortcode to hide selection of text from post or page?
- How to hide the Text Area in the Post Edit screen
- remove from text-widget
- How to remove images from showing in a post with the_content()?
- HTML / Javascript in custom field textarea?
- Display only text to WordPress loop without loosing the text formatting
- Hide content-box on specific pages (in admin)?
- Create thumbnail on PDF upload with Gravity Forms
- TCPDF get_post_meta outside the loop
- How to search and replace text in all posts of a wordpress.com blog (NOT wordpress.org one)?
- How does the media library determine if a PDF file has preview images?
- Is There A WordPress Plugin That Produces PDFs of Posts Locally?
- Displaying fractions in text
- Add Descriptive text to Widget text box so users can see what they contain