Sassafras4u: Science Article SpreadSheets And Fast Reference Archive System
cornerall1
hline
item4
Purpose

“Cleaned” Text

To make reading much faster and more enjoyable, I first (i.e., before even attempting to begin reading) remove text that is not directly related to the article itself—what I’ve come to refer to as “garbage text.” This includes advertising, unrelated links, pull-quotes, extra spaces and paragraph returns.... I also change “special characters” (keystrokes that appear differently on Macs and PCs, including foreign-accented or ligature characters, “curly” quotes, bullets, ellipses, dashes, fractions, etc.) into “universal typewriter” characters. This ensures that the text files will be readable by any computer or digital device, not only now but, even more importantly, very, very far into the foreseeable future.

Almost magically, this is done “automatically” and very, very quickly using a long-obsolete and quite amazing drag-and-drop application called SuperReplace (SR)—see below.

Let me emphasize that the very short time it takes for SR to dramatically “clean” 100-375+ articles—all at once, in one quick drag-and-drop motion—is insignificant compared to the many minutes, hours, days or months that it’s undoubtedly saved me thus far compared to if I had attempted (or had sufficient extra time) to read these same articles from their original webpages or from raw text copied straight from their webpages.

That is, it takes me much less time to speed-read the cleaned text files than it does to read the articles from their original sources (see Purpose and Text Viewing Options)—although I like to keep the original webpage visible simultaneously while I’m reading the text, in order that the layout gives me a feel for the original intent of the writer or editor, and I can view the pertinent graphics concurrently.

The cleaned articles are saved as styled text files. Download the article “Data For the Next Generations” (#121 on 11/12/07 spreadsheet), which adds strong support for styled text as my archive format of choice.

As of December 2010, especially for the convenience of PC users, the text files are now made available as Microsoft Word RTF (rich text format) files, in order to preserve the easy-to-read 2-column layout, font families, relative point sizes, styles and colors that Mac users have been able to enjoy for many years with the styled text files using the free text viewer Tofu.

The screenshots below show the original text on the left—with the text operated upon colored red for clarity—and the “cleaned” text on the right.

The text visible on the left (down to the sassafras leaf on left) is shown cleaned on the right (down to the sassafras leaf).

If nothing else, quite a lot of scrolling has been eliminated, but, more importantly, I find that reading the cleaned text is much less fatiguing and clearly far more satisfying.

compare2
spacerleaf
compare1
compare1

This even-more-reduced example shows how almost four pages of original text has been reduced to one easy-to-read “cleaned” page.

Many more examples are shown in the PDF files from which these screenshots come. Download them here (Zip 189K).

The PDF original.pdf is a compilation of excerpts from several articles selected to illustrate the wide variety of “garbage text” that has been removed or changed to make the text files more enjoyable to read.

In the PDF original-redhilite.pdf, the garbage text has been colored red for clarity, as shown in the screenshot to the left.

Notice, however that it is much more difficult—even annoying—to read text from the original (without red highlighting) because you have to work harder to differentiate between the garbage and main article text.

The cleaned text is in cleaned.pdf. If possible, have visible original.pdf and cleaned.pdf side-by-side on your screen to fully appreciate the difference between the original text as copied from the webpage and the cleaned text.

* If you choose to have the text spoken (audibly by your computer) to you (using Tofu or other word-processing speech services), you will quickly appreciate not having to listen to tons of irrelevant “garbage text,” while having to pay extra attention to make sure you don’t miss any of the “real” article text.

SuperReplace

The application I use to “clean” the text is SuperReplace, which, unfortunately, is long obsolete and can only be used on my old Mac (running OS 7.5.1), or using OS X 10.4 Tiger’s “Classic” (OS 9.2) layer.

SuperReplace (SR) is an amazing drag-and-drop search-and-replace application that is able to sequentially search for and replace text strings that I specify. At present there are over 5,600 situations that I ask SR to look for and rearrange or modify in some way. I have been searching for a modern equivalent of SR but so far have had no success.

I’ve found three unique advantages of SR’s filter-writing syntax over other similar applications:

(1) You can use very powerful “wild cards” to specify your search.

(2) You can use text strings, or parts of text strings, that you’ve “found” (in the first step of the process) to designate what to “replace” it with. The benefit of this most powerful feature doesn’t become clear immediately and is even difficult to describe but it seems to be lacking in the similar applications that I’ve examined so far.

(3) SR is drag-and-drop, which means, for example, that I can drop all 100-375+ articles for the day on the SR “filter” and it will operate on them all at once. It takes the first, applies the first of the search-and-replace commands to the whole text article, applies the second command, etc., until it has gone through the 5,600+ commands I have so far. It then writes a new “cleaned” plain text file, leaving the original file untouched.

It then does the same for the second file, and so on, until it’s gone through all, e.g., 275 text files and produced 275 new cleaned files.

To emphasize the significance of this, for example, the example file original.pdf (above) was cleaned by SR—resulting in cleaned.pdf—in a mere 7 seconds! In comparison, there’s very little I could’ve accomplished manually in 7 seconds to clean original.pdf of garbage text.

If anyone out there knows how to get in touch with the creator of SuperReplace, Guoniu Han, who resided in France when SR was available for sale, please let me know. You will be doing the world a great favor if I can contact him and convince him to rewrite SR for Mac OS X, Windows and Linux. There is still nothing that equals this decade-old utility!

October 14, 2008 Update—

I was able to actually find and exchange messages with the creator of SuperReplace, Dr. Han. He is working hard as a professor of mathematics in France, has switched over to Windows, and with two young daughters, doesn’t have any time or energy to be able to update SR any further.

However, I was very pleased that I was able to find him in order to thank him for his great contribution and to inform him that I was still making daily—and critical—use of his incredibly useful software.

A personal note: my daughter is now working for an architecture firm in Paris, not too far from Dr. Han. Perhaps someday I’ll be able to thank him in person.

result

This SR result dialog box shows
that SR cleaned the text file “original,” as above, finding
26,588
sequential situations or “Occurrences” (misspelled) that
I’ve asked it to search for and modify, in a mere 7 seconds—
3,798 situations per second—
much faster than I could manually
do anything close to this!

“Pre-cleaning” with QuicKeys and Mariner

Many situations occur frequently enough that I find it convenient to clean them up immediately using the macro utility QuicKeys to have my word processor Mariner Write perform a sequence of steps automatically—by typing a single keystroke.

Download examples demonstrating the difference between the text as copied from the original webpage and what I am instantly able to accomplish using QuicKeys.
 

Original “Uncleaned” Files

You may notice files similar to the others in the text archives (in the “original text examples” folder) but with the suffix “.0” (before the .txt suffix): e.g., 11.txt and 11.0.txt. The .0 files contain the original text as copied from the original article’s Web page. I have retained a sample of these over the years for comparison purposes.

Files with the suffix .0notes contain added notations to point out situations that may not be visibly obvious, such as ''[2 apostrophes].

As of late 2008, I have placed these original text files in a separate folder (in addition to the “all” and “highlights” folders) entitled “original text examples” or “original raw text examples.”
 

Cleaned Text Files on this Site

The cleaned text files I’ve created for myself are thus of much greater value than it may first appear, since they represent the ideal format for my personal archive of articles.

As mentioned elsewhere, if you’re not convinced of the time and effort that using SuperReplace has saved me, try using one day’s spreadsheet to view, copy the text yourself from the full day’s collection of articles. Clean the text manually of garbage text and save the cleaned text as a plain text document.

I’ve found that it’s much easier to later trash the files I don’t want to keep than to have to decide immediately, before having more closely read it, whether it’s worth saving a web article as a text file in my personal archive.