RegEx in PN – Example

Discussion Forums discussion RegEx in PN – Example

This topic contains 0 voices and has 4 replies.

Viewing 5 posts - 1 through 5 (of 5 total)
Author Posts
Author Posts
January 11, 2012 at 2:51 am #6812

NickDMax
Member

So I was thinking about this RegEx/Script example from the Untidy blog and I thought: “Can that be one in 1 RegEx?”

The answer is of course: Yes – but would you really want to.

Search: ^(?:(?!LogStep).*(?:rn)+)|(?:LogStep(s*”([^"]+)”s*).*(rn)(rn)*)

Replace: 12

hit <Replace All>

This is a pretty complex RegEx so lets break it up. The overall structure is A|B

A: RegEx to match the rest of a line NOT BEGINNING with “LogStep”

B: RegEx to match the rest of a line that DOES begin with “LogStep” – capture the text (and a CRLF new line)

The whole thing is wrapped in a the expression: ^(?:A|B)

where ^ asserts the beginning of a line, and (?: ) is a non-capture group – that is it treated as a group but not captured into 1 or 2 etc. – this outer non-capture group creates separation. Without it our expression would need to be: ^A|^B

Since the expression A contains no capture groups BUT the overall expression DOES contain capture 2 capture groups 1 and 2 will be empty when expression A matches. When B matches 1 is the text inside of the quotation marks, and 2 is a crlf (rn) pair.

January 11, 2012 at 3:05 pm #18369

NickDMax
Member

So how might this be useful? Well people often use TODO: tags in comments to identify tasks that need to be completed. The following PyPn script will extract all TODO: EOL comments (on their own line) into a new file.

###############################################################################
## Extract all ToDo lines into a new file
## By: NickDMax

import pn
import scintilla
from pypn.decorators import script

@script("Extract Todo","CPP Utils")
def Name():
""" Extract TODO: lines into a new file """
#recorded search/replace wrapped in an UndoAction
doc = pn.CurrentDoc()
sci = scintilla.Scintilla(doc)
sci.BeginUndoAction()
opt = pn.GetUserSearchOptions()
opt.FindText = u'^(?:(?![ \t]*//[ \t]*(?i:TODO:)).*(?:r\n)+|[ \t]*//[ \t]*(?i:TODO:)[ \t]*(.*)(\r\n)?(?:r\n)*)'
opt.MatchWholeWord = False
opt.MatchCase = False
opt.UseRegExp = True
opt.SearchBackwards = False
opt.LoopOK = True
opt.UseSlashes = False
opt.ReplaceText = u'\1\2'
opt.ReplaceInSelection = False
doc.ReplaceAll(opt)
sci.EndUndoAction()
#Now to save the text in a new file
startPos = 0
endPos = sci.Length
text = sci.GetTextRange(startPos, endPos)
newDoc = pn.NewDocument("")
newsci = scintilla.Scintilla(newDoc)
newsci.BeginUndoAction()
newsci.AppendText(len(text), text)
newsci.EndUndoAction()
#restore the document
sci.Undo();

Of course I have only tested this with one test file:

#include<iostream>

//TODO: remember to do something here.

int main() {
//TODO: remember to do something else here
// TODO: oh don't forget to do this

return 0;
}

// ToDo: and then there is this here too.

So a little more testing is probably needed to ensure that it is working as expected.

January 11, 2012 at 11:37 pm #18370

NickDMax
Member

As long as we are on the topic of RegEx: There is a defect logged about the nature of the dot (.) operator. As I understand it the dot operator should not match to EOL chars. The XPressive search is given the option regex_constants::not_dot_newline which keeps the dot from matching to n. However windows tends to use rn as a newline marker and the dot will match to r.

So when it comes to the end of the line, the dot operator can be a little funny. So often it is best just to not use it!

the dot operator operates as: [^n] — matches everything BUT the new line char.

What windows users typically expect: [^rn] — match everything BUT the windows new line chars

How to match any char: [sS] — will match any char.

Example: Say you want to replace the inner most div tags with span tags.

Find What: <div>(?![sS]*?<div>[sS]*)([sS]*?)</div>

Replace with: <span>1</span>

The basic version of this regex is: <div>([sS]*?)</div>

this captures all chars inside of a div tag– however if the tags are nested then problems happen since this will match the first open div tag with the first close div tag. So we add an assertion that the match does not include a div tag itself. The result is that we will look for the inner most div tag (assuming of course valid HTML).

The point of the example is that we can use [sS] to capture all chars in the tags (including new line chars). s is the class of all white-space characters and S is the set of all NON-white-space characters. One could equally use [wW] or [dD]

Sometimes however it is nice to be able to limit our searches to a single line. Using [sS] can get us into trouble here since even something like: //([sS]*)$

an attempt to grab the content of an end-of-line comment – but this will match past the end of lines. We would need: //([sS]*?)$ to ensure the non-greedy match.

Or we could just be more specific on what we were looking for: //([^rn]*)

So it is true that the dot operator is a little buggy in PN (when operating of files using rn or r line endings) but a little knowledge can get you around and let you use the full power of PN’s regular expression search and replace.

January 12, 2012 at 10:20 am #18371

simon
Key Master

Excellent stuff! This would be great content for the docs or blog – perhaps both.

It’s definitely worth looking into Xpressive’s newline matching behaviour. I agree that generally we don’t want to match newlines, although for some multi-line matching it can be useful.

January 12, 2012 at 3:30 pm #18372

NickDMax
Member

Well I think that the dot operator should be [^rn] because as the defect points out the current behavior is not consistent. However I love that PN does actually support multi-line regex – i.e. that [sS] WILL match across new-line chars. Though I suppose that it would be nice to have a “single-line/muti-line” option that would either look at the document on a line-by-line basis or (as it is currently) as a single stream.

Since this is one area of PN that I use extensively I too will be looking into improving PN’s regex support. For my personal use I have already added the format_perl features from Xpressive.

I also spectacularly failed in my attempt to add Unicode support for replace! However, I am still looking into how you managed to get unicode search working and I am confident that I can sort things out. My fail-safe plan is to parse the replace string and “encode” the unicode chars into x00x00 etc. so that the replace string can remain a std::string. However, I know next to nothing about the non-ASCII world so I don’t know how successful I will be. I will probably pester you with questions. :)

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic.