Copy Text from PDF without Line Breaks & More Word Tricks

Super User is a question and answer site for computer enthusiasts and power users. It only takes a minute to sign up. I need to get thousands of snippets of text from PDFs to a spreadsheet. They are short, seldom more than rows, but each line break creates a new cell, and I have to repair that manually, which costs lots of time.

Because I have so many of them, using the "paste into Word and do a find-and-replace" workaround is just too time-wasting for me. Is there a way to have the line break disappear on copy? Maybe there is a viewer which offers a special copy mode for this, or has a plugin? The documents are scientific articles.

The text arrangement is quite linear. You can assume that the text I'm copying is not inside a table or a float, and not rotated or anything.

If such a thing happens, I think I'll deal with it manually. The text is frequently set in two columns, but I have no trouble marking just the text I need from its column. I don't need to preserve any special formatting. I'm willing to try a solution which removes all unprintable characters, for example.

I have a very strong preference for a solution which will work on Linux, possibly some kind of Okular plugin. But if there happens to be a Windows-only solution, I want to hear about it too. I have a license for a somewhat recent Acrobat Pro on the Windows machine. I had a similar problem while I was working on a text to speech script a while ago.

My script would try to break up the text input into chunks by looking for newlines. With PDF files this would result in a mess because of the way each line ends with a newline. So what I did was compose a few sed and tr commands to only consider newlines ending with a full stop as actual line breaks.

It wasn't very pretty but it worked. The script uses xsel to parse the currently highlighted text and then modifies it with the sed and tr command-line I mentioned above. The processed text is then passed back to the clipboard via xsel -bi. This has been bugging me for years, so I figured out a general Windows solution using Autohotkey. Autohotkey is a lightweight, free, open-source scripting software for Windows to create hotkeys for almost anything imaginable.

In case of a PDF reader, it copies the selection, removes linebreaks and double spaces and puts result into the clipboard. If nothing is selected, the clipboard is practically untouched. You can figure out the class for your own software easily by the WinGetClass command e. If you prefer to read PDF-s in your browser, this is not your solution. Another thing that worked out for me was saving the pdf file as html.

Other file formats work as well, such as txt or rtf This should also work on Linux systems. A third approach using macros is shown here , but I haven't tried it. I pasted the macros here for future reference, macro 2 is by the author of the source - "Deborah Savadra" - and macro 1 by her reader "Benjamin":. There is a Windows solution shown here. I tried it out and it works just fine, except that it removes all linebreaks. So if you copy multiply paragraphs you later have only one. There is a related question on SU with a littlebit explanation, it may be of interest for someone This bash script removes line breaks when copying text from PDF.

It works for both Primary Selection and Clipboard of linux. I know this is an old question, however I felt it would be useful to answer it because no other solution was as easy to use as this one. Use the linux app named Okular to open your pdf file. Then select your text as it was in table form. If you have Acrobat, click your cursor so the cursor is blinking in the text. It won't work if you don't do that. Go to Advanced, Accessibility, Add tags.

The best answers are voted up and rise to the top. Is there an efficient way to copy text from a PDF without the line breaks? Ask Question. Asked 6 years, 7 months ago. Active 1 year, 6 months ago. Viewed 22k times. Improve this question. Did you try with foxit reader? See linuxquestions. Kasun FoxitReader or whatever reader one uses is irrelevant: the pdf file is the one that introduces the linebreaks. Add a comment. Active Oldest Votes. Using this snippet I wrote a small script for you that I hope will help:!

Improve this answer. Glutanimate Glutanimate 2 2 silver badges 13 13 bronze badges. Windows MiCl It still works at my end. Did you change anything? Like updating your reader? On the other hand, who knows what was updated by Win Quasimodo Quasimodo 1 1 silver badge 6 6 bronze badges.

I pasted the macros here for future reference, macro 2 is by the author of the source - "Deborah Savadra" - and macro 1 by her reader "Benjamin": macro 1: Sub pagebreaks ' ' pagebreaks Macro ' ' Selection. ClearFormatting Selection. ClearFormatting With Selection.

It will be easier to vote them individually that way. Based on Glutanimate's script. Make sure that script and clipnotify downloaded or precompiled are in same folder. Lines breaks will be removed. SidMan SidMan 31 2 2 bronze badges. Arvanitis Christos Arvanitis Christos 2 2 silver badges 12 12 bronze badges. This answer should make it closer to the top, as it is more relevant to the question than other answers i.

Just tried this and I still had the line endings when I pasted special and selected unformatted text. Maybe things have changed. Okular is version 0. Slightly faffy but once you get the shortcuts under your fingers it's much quicker. Sunner Sunner 1. Copy and paste is not reliable, that's the entire point of the question. If one wants to cleanup by search and replace, thed would first convert to text with pdftotext and then use any text editor they like with standard regex. The Overflow Blog.

