Scraping is the common name for extracting desired information out of a file. I copied and pasted some auction domains from GoDaddy.com into a file. I then wanted to delete everything except the domain names. While there might be a Regular Expression to do this, I couldn’t find exactly what I wanted. Instead I came up with this multi-step process. A programmer could do this, and I am a programmer; but sometimes you want to do something quick and dirty in a free tool like NotePad++. NotePad++ has full support for Regular Expressions, called “RegEx” for short.
Here’s an example of what I copied and pasted:
Now I wanted to get just the domain names and paste into another tool that would give me the metrics of each web site.
Here’s the Regular Expression I came up with: ([-_\w]*?\.(com|net|org)). I can’t teach RegEx in this one blog, there are many sites dedicated to that, such as RegExOne.com (that will even let you practice as you go through the lesson by actually running simple regular expressions.
The above makes use of regular expression in NotePad++, or “RegEx” for short. The brilliant idea of this type of expression came about in the 1950s, when the American mathematician Stephen Kleene formalized the description of a regular language. It became first popular in the Unix community, with command-line text processing utilities ed, an editor, and grep, a filter. The PERL language probably helped to popularize it further.
While RegEx is not always easy to learn, it can be very powerful for doing repetitive edits or extraction of data from one file to another.