Cleaned Text |
|||||||||||||||||||||||||||||
To make reading much faster and more enjoyable, I first (i.e., before even attempting to begin reading) remove text that is not directly related to the article itselfwhat Ive come to refer to as garbage text. This includes advertising, unrelated links, pull-quotes, extra spaces and paragraph returns.... I also change special characters (keystrokes that appear differently on Macs and PCs, including foreign-accented or ligature characters, curly quotes, bullets, ellipses, dashes, fractions, etc.) into universal typewriter characters. This ensures that the text files will be readable by any computer or digital device, not only now but, even more importantly, very, very far into the foreseeable future. Almost magically, this is done automatically and very, very quickly using a long-obsolete and quite amazing drag-and-drop application called SuperReplace (SR)see below. Let me emphasize that the very short time it takes for SR to dramatically clean 100-375+ articlesall at once, in one quick drag-and-drop motionis insignificant compared to the many minutes, hours, days or months that its undoubtedly saved me thus far compared to if I had attempted (or had sufficient extra time) to read these same articles from their original webpages or from raw text copied straight from their webpages. That is, it takes me much less time to speed-read the cleaned text files than it does to read the articles from their original sources (see Purpose and Text Viewing Options)although I like to keep the original webpage visible simultaneously while Im reading the text, in order that the layout gives me a feel for the original intent of the writer or editor, and I can view the pertinent graphics concurrently. The cleaned articles are saved as styled text files. Download the article Data For the Next Generations (#121 on 11/12/07 spreadsheet), which adds strong support for styled text as my archive format of choice. As of December 2010, especially for the convenience of PC users, the text files are now made available as Microsoft Word RTF (rich text format) files, in order to preserve the easy-to-read 2-column layout, font families, relative point sizes, styles and colors that Mac users have been able to enjoy for many years with the styled text files using the free text viewer Tofu. The screenshots below show the original text on the leftwith the text operated upon colored red for clarityand the cleaned text on the right. The text visible on the left (down to the sassafras leaf on left) is shown cleaned on the right (down to the sassafras leaf). If nothing else, quite a lot of scrolling has been eliminated, but, more importantly, I find that reading the cleaned text is much less fatiguing and clearly far more satisfying. |
|||||||||||||||||||||||||||||
This even-more-reduced example shows how almost four pages of original text has been reduced to one easy-to-read cleaned page. Many more examples are shown in the PDF files from which these screenshots come. Download them here (Zip 189K). The PDF original.pdf is a compilation of excerpts from several articles selected to illustrate the wide variety of garbage text that has been removed or changed to make the text files more enjoyable to read. In the PDF original-redhilite.pdf, the garbage text has been colored red for clarity, as shown in the screenshot to the left. Notice, however that it is much more difficulteven annoyingto read text from the original (without red highlighting) because you have to work harder to differentiate between the garbage and main article text. The cleaned text is in cleaned.pdf. If possible, have visible original.pdf and cleaned.pdf side-by-side on your screen to fully appreciate the difference between the original text as copied from the webpage and the cleaned text. * If you choose to have the text spoken (audibly by your computer) to you (using Tofu or other word-processing speech services), you will quickly appreciate not having to listen to tons of irrelevant garbage text, while having to pay extra attention to make sure you dont miss any of the real article text. |
|||||||||||||||||||||||||||||
SuperReplace The application I use to clean the text is SuperReplace, which, unfortunately, is long obsolete and can only be used on my old Mac (running OS 7.5.1), or using OS X 10.4 Tigers Classic (OS 9.2) layer. SuperReplace (SR) is an amazing drag-and-drop search-and-replace application that is able to sequentially search for and replace text strings that I specify. At present there are over 5,600 situations that I ask SR to look for and rearrange or modify in some way. I have been searching for a modern equivalent of SR but so far have had no success. Ive found three unique advantages of SRs filter-writing syntax over other similar applications: (1) You can use very powerful wild cards to specify your search. (2) You can use text strings, or parts of text strings, that youve found (in the first step of the process) to designate what to replace it with. The benefit of this most powerful feature doesnt become clear immediately and is even difficult to describe but it seems to be lacking in the similar applications that Ive examined so far. (3) SR is drag-and-drop, which means, for example, that I can drop all 100-375+ articles for the day on the SR filter and it will operate on them all at once. It takes the first, applies the first of the search-and-replace commands to the whole text article, applies the second command, etc., until it has gone through the 5,600+ commands I have so far. It then writes a new cleaned plain text file, leaving the original file untouched. It then does the same for the second file, and so on, until its gone through all, e.g., 275 text files and produced 275 new cleaned files. To emphasize the significance of this, for example, the example file original.pdf (above) was cleaned by SRresulting in cleaned.pdfin a mere 7 seconds! In comparison, theres very little I couldve accomplished manually in 7 seconds to clean original.pdf of garbage text. If anyone out there knows how to get in touch with the creator of SuperReplace, Guoniu Han, who resided in France when SR was available for sale, please let me know. You will be doing the world a great favor if I can contact him and convince him to rewrite SR for Mac OS X, Windows and Linux. There is still nothing that equals this decade-old utility! October 14, 2008 Update I was able to actually find and exchange messages with the creator of SuperReplace, Dr. Han. He is working hard as a professor of mathematics in France, has switched over to Windows, and with two young daughters, doesnt have any time or energy to be able to update SR any further. However, I was very pleased that I was able to find him in order to thank him for his great contribution and to inform him that I was still making dailyand criticaluse of his incredibly useful software. A personal note: my daughter is now working for an architecture firm in Paris, not too far from Dr. Han. Perhaps someday Ill be able to thank him in person. |
|||||||||||||||||||||||||||||
This SR result dialog box shows |
|||||||||||||||||||||||||||||
Pre-cleaning with QuicKeys and Mariner Many situations occur frequently enough that I find it convenient to clean them up immediately using the macro utility QuicKeys to have my word processor Mariner Write perform a sequence of steps automaticallyby typing a single keystroke. Download examples demonstrating the difference between the text as copied from the original webpage and what I am instantly able to accomplish using QuicKeys. Original Uncleaned Files You may notice files similar to the others in the text archives (in the original text examples folder) but with the suffix .0 (before the .txt suffix): e.g., 11.txt and 11.0.txt. The .0 files contain the original text as copied from the original articles Web page. I have retained a sample of these over the years for comparison purposes. Files with the suffix .0notes contain added notations to point out situations that may not be visibly obvious, such as ''[2 apostrophes]. As of late 2008, I have placed these original text files in a separate folder (in addition to the all and highlights folders) entitled original text examples or original raw text examples. Cleaned Text Files on this Site The cleaned text files Ive created for myself are thus of much greater value than it may first appear, since they represent the ideal format for my personal archive of articles. As mentioned elsewhere, if youre not convinced of the time and effort that using SuperReplace has saved me, try using one days spreadsheet to view, copy the text yourself from the full days collection of articles. Clean the text manually of garbage text and save the cleaned text as a plain text document. Ive found that its much easier to later trash the files I dont want to keep than to have to decide immediately, before having more closely read it, whether its worth saving a web article as a text file in my personal archive. |
|||||||||||||||||||||||||||||