Pages

Thursday, January 26, 2017

What Else Programmers Do: A Text Manipulation Example

Sometimes, in my "regular" (i.e. non-IT) life, I apply my job skills. Today, for example, I came across a Facebook post about how to contact and influence Congress members. The post suggested cutting and pasting to share with others. Unfortunately, it wasn't possible to select the text. Some sites do this to prevent copying their content. News sites, especially, for copyright reasons.

What follows is not a tutorial for regular (i.e. non-technical-programming-nerdy) people. It's an example of the kind of process I do in my job every day, and how I think about a problem. It requires tools, knowledge and experience.

Getting to the "Source" of the Matter

Now, the first problem is that I viewed the original FB post on my phone, and it turns out that's why I couldn't copy/paste the text. If I'd bothered to find it on my desktop, my problem would have been solved. Instead, I emailed myself a link to the page. That link still went to the mobile version. That's what the "m." in the URL is telling me.

2017-01-26 09_35_13

Here's some of the text:

2017-01-26 09_38_16

I can't copy the text, it's sort of like it's an image, but I'm pretty sure it's not an image. My next step is to view the page source HTML. I right-click in Chrome and choose "View page source".

2017-01-26 09_40_46

Which yields a huge mess of coded text, as befitting a complex web site. There are about three thousand more lines like this.

2017-01-26 09_42_55

I'm pretty sure buried in there is the text I want. And I'm also pretty sure I'll need to manipulate it. So, I open Notepad++, a sophisticated, free, open source text editor. Then I copy/paste the source from the browser into it.

2017-01-26 09_46_56

Now I'm ready to use my text editor.

Text Editing is Not the Same as Word Processing

In word processing, like you do with Microsoft Word, the concern is text formatting: making things bold, italic, different fonts, etc. The result of all that work is marked up text (or binary, but we'll stick with text) that tells the application how to display all of your words on the screen. Here's an example from a Word file that was saved in its standard .docx format. The document literally has one word: "hello". But Word needs all of this information to display that those five characters.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document 
xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" 
xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" 
xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" 
xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" 
xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" 
xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" 
xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" 
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" 
xmlns:o="urn:schemas-microsoft-com:office:office" 
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" 
xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" 
xmlns:v="urn:schemas-microsoft-com:vml" 
xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" 
xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" 
xmlns:w10="urn:schemas-microsoft-com:office:word" 
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" 
xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" 
xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" 
xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" 
xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" 
xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" 
xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" 
xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 w16se wp14">
  <w:body>
    <w:p w:rsidR="000E45DD" w:rsidRDefault="00FA6736">
      <w:r>
        <w:t>hello</w:t>
      </w:r>
      <w:bookmarkStart w:id="0" w:name="_GoBack"/>
      <w:bookmarkEnd w:id="0"/>
    </w:p>
    <w:sectPr w:rsidR="000E45DD">
      <w:pgSz w:w="12240" w:h="15840"/>
      <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/>
      <w:cols w:space="720"/>
      <w:docGrid w:linePitch="360"/>
    </w:sectPr>
  </w:body>
</w:document>

The concern of text editing is working with raw content like that above. A good text editor lets you perform complex Find & Replace, or do block editing (where you can select text not just in rows, but also in a columns).

Back to our FB post source. I know that somewhere in the text is the bit that I want. So, I search for some text that's very likely unique to the content I'm looking for. "Senator" is a good choice.

2017-01-26 10_00_34

TECH DIGRESSION!

I found, at this point, why I couldn't copy/paste the text in the first place. All of the content was inside a <script> tag. It was probably being processed by javascript, and displayed in some messed-up way.

There's other markup in there, but I'll take care of that in a little bit. First, I copy what looks like the text want into a new Notepad++ tab.

2017-01-26 10_06_45

I Need a Replacement

Now, I look at what's there, while thinking about what I want. Ideally, my end result is nicely formatted text, meaning it's got all the original line breaks. I could do that manually, but I'd prefer not to.

The first thing I notice is the markup after "Senator."

From a high-level staffer for a Senator:\u003C\/p>\u003Cp> There are two things 

That sequence, \u003C\/p>\u003Cp> appears throughout the text exactly where line breaks are in the original.

2017-01-26 10_13_55

So let's see what happens if I just replace all of those instances with a new line and carriage return.

TECH DIGRESSION!

Huh? How (you wonder) do I do that? How do I insert something that can't be seen? Well, even though you can't see them, line breaks (where you go down one line) and carriage returns (where you go back to the beginning of the line) are stored in text just like any other character. They just aren't displayed in the editor. But there are ways to represent them, and one long-standing way is with C programming-style expressions. I can use text to represent other, hidden text.

 

\n = newline
\r = carriage return
\t = tab stop

 

The back slash is an "escape" character. It tells the application "don't treat this sequence \n literally. Instead, look in the raw text for the hexadecimal byte sequence 0D." As an example, here are two lines of text:

life's
good

and its hex representation. I put the represented letters and linefeed/carriage return above the hex for clarity.

l  i  f  e  '  s  \n \r g  o  o  d
6c 69 66 65 27 73 0d 0a 67 6f 6f 64  

So, I select \u003C\/p>\u003Cp> and try to replace it with \r\n\r\n (two carriage return + new line). And it fails.

2017-01-26 10_32_08

Why? For the same reason the newline can be found: the escape character \. Notepad++ is looking at my "Find what" string and saying, "Oh, you've got a back slash in front of that u. That must mean something special to me, so I won't treat it as '\u'. I can't find your text, sorry."

What do we do if we want the back slash to be treated literally? Heh, heh. You escape it by putting another back slash in front of it. To find a literal occurrence of \r in text using C-style expressions, you search for \\r. Crazy logical, right?

My search express becomes this: \\u003C\\/p>\\u003Cp>

I try my Find & Replace again, and this time it works.

2017-01-26 10_39_04

I could live with this result if I weren't a perfectionist. But there's still lots of HTML markup in there (highlighted in yellow). So, I do my Find & Replace magic a bunch more times. By looking at the original text, I can figure out what characters the markup stood for (or I could look it up). There's also some non-standard markup.

&#039;  =   '
&quot;  =   "
&amp;   =   &
&gt;    =   >
\u2013  =   --
\/      =   /

The result is the text as I originally saw it on the post. Here's a partial example.


From a high-level staffer for a Senator:

 There are two things that all Democrats should be doing all the time right now, and they're by far the 
 most important things. [***If you want to share this, please copy and paste so it goes beyond our mutual
 friends***]

 --> You should NOT be bothering with online petitions or emailing.

 1. The best thing you can do to be heard and get your congressperson to pay attention is to have 
face-to-face time - if they have townhalls, go to them. Go to their local offices. If you're in DC, try
 to find a way to go to an event of theirs. Go to the "mobile offices" that their staff hold periodically
 (all these times are located on each congressperson's website). 

 When you go, ask questions. A lot of them. And push for answers. The louder and more vocal and present
 you can be at those the better.

 2. But, those in-person events don't happen every day. So, the absolute most important thing that people
 should be doing every day is calling. 

 You should make 6 calls a day: 2 each (DC office and your local office) to your 2 Senators & your 1 
Representative.

Wrap Up

It took a lot to extract the text from that mobile web page.

  1. Use the browser's View Source feature
  2. Use a specialized text editor
  3. Find the content I was looking for
  4. Recreate the line endings via Find & Replace using C programming language expressions
  5. Replace the remaining HTML markup

If you're wondering what kind of things that special programmer in your life is doing every day, this is it. It's not glamorous. In fact, it's pretty tedious. But, it's a living!

Wednesday, January 25, 2017

What does this code do? Extended explanation

Ayende posted some code that didn't work as expected, and an explanation of why. You can read his full post here.

What does this code do?

But I'm not as bright as his readers, and was still mighty confused. It took me about ten minutes to finally see what was going on.

Ayende's original code is this:

var doc = new Dictionary<string,object>
{
    ["@metadata"] = new Dictionary<string, object>
    {
        ["@id"] = "users/1"
    }
    ["Name"] = "Oren"
};


Console.WriteLine(doc["Name"]);

The final statement fails, because doc is actually equal to a string value of "Oren", not the dictionary object. Why? And--for me--I couldn't understand why it was compiling at all.

It took me a while to see the syntax parsing, and remember that C# can be very forgiving of spaces. I'm not a beginner, but it's sort of a beginner mistake. For other readers like me who aren't as bright (if you have any!), here's another look.

While it appeared to me an element was being set completely outside of the declaration, the element is being set on the declared dictionary.

The final syntax is:

var item = NewDictionary[element] = value;

But that syntax can look like this:

var item = NewDictionary              [element] = value;

Code samples

//spaces before the element don't matter
var item = new Dictionary<string, object> {}    ["a"] = 1;
//is the same as
var item = new Dictionary<string, object> { }["a"] = 1;

//For clarity, we replace the declaration with its assignment.
var dic = new Dictionary<string, object> {};
var item = dic     ["a"] = 1;
//is the same as
var item = dic["a"] = 1;

I looked for other examples of where this can be done. The simplest array case doesn't compile:

//This is OK.
int[] items = new int[1] {1};
int item = items [0] = 2;

//But this isn't.
int[] items = new int[1] { 1 } [0] = 2;

Hashtable works the same as Dictionary, which is logical.

//OK
var h = new Hashtable();
h [0] = 1;

//OK, too.
var h = new Hashtable() [0] = 1;

Thanks, Ayende, for getting my brain working!