What Else Programmers Do: A Text Manipulation Example

Sometimes, in my "regular" (i.e. non-IT) life, I apply my job skills. Today, for example, I came across a Facebook post about how to contact and influence Congress members. The post suggested cutting and pasting to share with others. Unfortunately, it wasn't possible to select the text. Some sites do this to prevent copying their content. News sites, especially, for copyright reasons.

What follows is not a tutorial for regular (i.e. non-technical-programming-nerdy) people. It's an example of the kind of process I do in my job every day, and how I think about a problem. It requires tools, knowledge and experience.

Getting to the "Source" of the Matter

Now, the first problem is that I viewed the original FB post on my phone, and it turns out that's why I couldn't copy/paste the text. If I'd bothered to find it on my desktop, my problem would have been solved. Instead, I emailed myself a link to the page. That link still went to the mobile version. That's what the "m." in the URL is telling me.

Here's some of the text:

I can't copy the text, it's sort of like it's an image, but I'm pretty sure it's not an image. My next step is to view the page source HTML. I right-click in Chrome and choose "View page source".

Which yields a huge mess of coded text, as befitting a complex web site. There are about three thousand more lines like this.

I'm pretty sure buried in there is the text I want. And I'm also pretty sure I'll need to manipulate it. So, I open Notepad++, a sophisticated, free, open source text editor. Then I copy/paste the source from the browser into it.

Now I'm ready to use my text editor.

Text Editing is Not the Same as Word Processing

In word processing, like you do with Microsoft Word, the concern is text formatting: making things bold, italic, different fonts, etc. The result of all that work is marked up text (or binary, but we'll stick with text) that tells the application how to display all of your words on the screen. Here's an example from a Word file that was saved in its standard .docx format. The document literally has one word: "hello". But Word needs all of this information to display that those five characters.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document 
xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" 
xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" 
xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" 
xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" 
xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" 
xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" 
xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" 
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" 
xmlns:o="urn:schemas-microsoft-com:office:office" 
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" 
xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" 
xmlns:v="urn:schemas-microsoft-com:vml" 
xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" 
xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" 
xmlns:w10="urn:schemas-microsoft-com:office:word" 
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" 
xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" 
xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" 
xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" 
xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" 
xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" 
xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" 
xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 w16se wp14">
  <w:body>
    <w:p w:rsidR="000E45DD" w:rsidRDefault="00FA6736">
      <w:r>
        <w:t>hello</w:t>
      </w:r>
      <w:bookmarkStart w:id="0" w:name="_GoBack"/>
      <w:bookmarkEnd w:id="0"/>
    </w:p>
    <w:sectPr w:rsidR="000E45DD">
      <w:pgSz w:w="12240" w:h="15840"/>
      <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/>
      <w:cols w:space="720"/>
      <w:docGrid w:linePitch="360"/>
    </w:sectPr>
  </w:body>
</w:document>

The concern of text editing is working with raw content like that above. A good text editor lets you perform complex Find & Replace, or do block editing (where you can select text not just in rows, but also in a columns).

Back to our FB post source. I know that somewhere in the text is the bit that I want. So, I search for some text that's very likely unique to the content I'm looking for. "Senator" is a good choice.

TECH DIGRESSION!

I found, at this point, why I couldn't copy/paste the text in the first place. All of the content was inside a <script> tag. It was probably being processed by javascript, and displayed in some messed-up way.

There's other markup in there, but I'll take care of that in a little bit. First, I copy what looks like the text want into a new Notepad++ tab.

I Need a Replacement

Now, I look at what's there, while thinking about what I want. Ideally, my end result is nicely formatted text, meaning it's got all the original line breaks. I could do that manually, but I'd prefer not to.

The first thing I notice is the markup after "Senator."

From a high-level staffer for a Senator:\u003C\/p>\u003Cp> There are two things

That sequence, \u003C\/p>\u003Cp> appears throughout the text exactly where line breaks are in the original.

So let's see what happens if I just replace all of those instances with a new line and carriage return.

TECH DIGRESSION!

Huh? How (you wonder) do I do that? How do I insert something that can't be seen? Well, even though you can't see them, line breaks (where you go down one line) and carriage returns (where you go back to the beginning of the line) are stored in text just like any other character. They just aren't displayed in the editor. But there are ways to represent them, and one long-standing way is with C programming-style expressions. I can use text to represent other, hidden text.

\n = newline
\r = carriage return
\t = tab stop

The back slash is an "escape" character. It tells the application "don't treat this sequence \n literally. Instead, look in the raw text for the hexadecimal byte sequence 0D." As an example, here are two lines of text:
life's
good
and its hex representation. I put the represented letters and linefeed/carriage return above the hex for clarity.
l  i  f  e  '  s  \n \r g  o  o  d
6c 69 66 65 27 73 0d 0a 67 6f 6f 64  

So, I select \u003C\/p>\u003Cp> and try to replace it with \r\n\r\n (two carriage return + new line). And it fails.

Why? For the same reason the newline can be found: the escape character \. Notepad++ is looking at my "Find what" string and saying, "Oh, you've got a back slash in front of that u. That must mean something special to me, so I won't treat it as '\u'. I can't find your text, sorry."

What do we do if we want the back slash to be treated literally? Heh, heh. You escape it by putting another back slash in front of it. To find a literal occurrence of \r in text using C-style expressions, you search for \\r. Crazy logical, right?

My search express becomes this: \\u003C\\/p>\\u003Cp>

I try my Find & Replace again, and this time it works.

I could live with this result if I weren't a perfectionist. But there's still lots of HTML markup in there (highlighted in yellow). So, I do my Find & Replace magic a bunch more times. By looking at the original text, I can figure out what characters the markup stood for (or I could look it up). There's also some non-standard markup.

&#039;  =   '
&quot;  =   "
&amp;   =   &
&gt;    =   >
\u2013  =   --
\/      =   /

The result is the text as I originally saw it on the post. Here's a partial example.

From a high-level staffer for a Senator:

 There are two things that all Democrats should be doing all the time right now, and they're by far the 
 most important things. [***If you want to share this, please copy and paste so it goes beyond our mutual
 friends***]

 --> You should NOT be bothering with online petitions or emailing.

 1. The best thing you can do to be heard and get your congressperson to pay attention is to have 
face-to-face time - if they have townhalls, go to them. Go to their local offices. If you're in DC, try
 to find a way to go to an event of theirs. Go to the "mobile offices" that their staff hold periodically
 (all these times are located on each congressperson's website). 

 When you go, ask questions. A lot of them. And push for answers. The louder and more vocal and present
 you can be at those the better.

 2. But, those in-person events don't happen every day. So, the absolute most important thing that people
 should be doing every day is calling. 

 You should make 6 calls a day: 2 each (DC office and your local office) to your 2 Senators & your 1 
Representative.

Wrap Up

It took a lot to extract the text from that mobile web page.

Use the browser's View Source feature
Use a specialized text editor
Find the content I was looking for
Recreate the line endings via Find & Replace using C programming language expressions
Replace the remaining HTML markup

If you're wondering what kind of things that special programmer in your life is doing every day, this is it. It's not glamorous. In fact, it's pretty tedious. But, it's a living!