C # in the regular expression (1)

  C # in the regular expression 

  Jeffrey EF Friedl wrote one on a regular expression of the "master the regular expression."    Author To make readers better understand and grasp the regular expression and fabricate a story.    Perl book mainly to the language.    As far as I know C # in the regular expression is also based on perl5.    Therefore, they should have many commonalities. 

  In fact, I do not intend to intact on the content of the book translation, a content of this book too much, and I simply do not competent translation of this work; two if I really translated this book, At the same time the code changed inside C #, with the original author in the absence of the circumstances, there may be violations of the suspects.    Therefore, the right to study notes as well. 

  Skip a lengthy preamble, we will have direct access to Chapter 1: 

  On the regular expression 

  The author said that this chapter is a regular expression for the absolute rookie and prepared for the purpose of the section is for the future and lay a solid foundation.    So if you are not rookie, you can ignore this chapter. 

  Scene story: 

  Department of the file you want to head a tool used to check the duplication of words (such as: this this), a large number of edit documents in the time normally encounter problems.    Your job is to create a solution: 

  Accept any amount to inspect the documents, reports, each with a document to repeat those words, highlighted these repetitive words, while ensuring that the original document name and will appear in these statements. 

  Interbank inspection, the last to find his next line of a word and the beginning of a repetition of the word. 

  Identify repeat the word, whether or not they different case (such as: The the), as well as allow these duplicate with different words between the number of blank characters (spaces, tabs, new firms, etc.) 

  Identify repeat the word, and even those words were separated Html tags.    (Eg:… it is <B> very </ B> very important.) 

  To solve the practical problems, we first have to do is write a regular expression, we want to find the text, we do not ignore the text, and then use our C # code to access the text for processing. 

  In the use of regular expressions, you may already know what the number is a regular expression.    Even you do not know, you almost certainly already familiar with the basic concepts of it. 

  You know report.txt is a specific file name, but if you have any Unix or DOS / Windows experience, as you know, "*. txt" can be used to select multiple documents.    This form of a document, some characters have a special meaning.    Matching asterisk means anything, a question mark means matching characters.    Such as: "*. txt" file name to any. Txt at the end of document. 

  Document in the name of pattern matching, the use of the limited match at.    Also present on the network search engine also allows the use of certain designated website to match content search.    Regular expressions are matched by rich characters, and deal with all kinds of complicated issues. 

  First, we introduced two positions matching Address: 

  ^: A line of text that the beginning of position 

  $: The end of that line of text position 

  Such as: expression: "^ Cat," matching word on this Cat in the beginning, attention ^ character is a place not to match the characters themselves. 

  Similarly, the expression: "Cat $" matching word Cat to appear at the end of the line. 

  Next, we introduce the square brackets in the expression of it that "[]", matching brackets in a character.    Such as: 

  Expression: "[0123456789]" will match figures 0-9 any one. 

  For example: We have to find the text, all contain gray or grey, then so can be written expression: "gr [ea] y" 

  [Ea] ea said in a match, and not the entire ea. 

  If we are to match the html <H1> <H2> <H3> <H4> <H5> <H6> labels, we can write expressions: 

  "<H[123456]>," But if we have to match all the characters in one?    Kazakhstan, the problem has come, and in square brackets to write all of the characters?    Luckily, we do not have to do this, we introduce the scope of symbols "-"; 

  Use of symbols, we only need to give a range of border characters can be, above Html example, we can read as follows: "<H [1-6]>" 

  The expression: "[0-9 a-zA-Z]", I mean it is now clear now?    It matched the number of characters in 26 lowercase and uppercase letters 26 in a letter. 

  [] Appear in the "^" symbol 

  If you see expressions such as: "[^ 0-9]," At this time, "^" is no longer in front of the location that symbols, here it is in the negative symbols, the meaning of that exclusion, the above expression, that is not contain numbers 0-9 characters. 

  Thinking 1: expression "q [^ u]".    If any of the following words, those who will be matched? 

Iraqi

Iraqian

  Miqra 

  Qasida 

  Qintar 

  Qoph 

  Zaqqum 

  Apart from the scope of the characters said, there is a point of character ".", A point characters in the expression of that match any character. 

  Expressions such as: "07.04.76" matches: 

  Form: 07/04/76, 07-04-76,07.04.76. 

  If we need some characters in the option, we can use options characters "|": 

  Options characters "or" means, such as expression: "[Bob | Robert]," said Bob or Robert will be matched. 

  We now see the expressions mentioned above: "gr [ea] y", we can use options character writing "grey | gray," they are the same. 

  The use of parentheses: parentheses in the expression of also being used as a metacharacter, such as in front of expression, we can be written: "gr (e | a) y", where the parentheses is essential, if there is no round brackets, then pattern "gre | ay" matches gre or ay, this is not the result we want.    If you are not clear, let us consider the following examples: 

  View in an e-mail to all From: or Subject: either Date: the beginning of the trip, we compare the following two expressions: 

  An expression: "From ^ | Subject | Data:" 

  Expressions 2: "^ (From | Subject | Data):" 

  Which one is that we want? 

  Obviously, the result of an expression is not the result we want, it will be a match: From or Subjec or Data:, 2 expression including the use of a round at, we will be able to meet our needs. 

  Word border 

  We have to match the first and expert in end-of-line characters, and then if we want not only to the first position or end-of-line?    We need to introduce the word boundary symbol, the word boundary symbol is: "\ b", slash not omitted, otherwise a letter b match.    The use of the word boundaries, we are able to match the location of positioning must appear in the beginning of a word or at the end of, rather than in the middle of the word.    For example: "\ bis \ b" expression in the string "This is a cat." Will match the word "is" and does not match the words "This" from "is." 

  Symbol string border 

  Apart from the above location symbol, if we are to match the entire string (including a number of words) then we can use the following two symbols: 

  \ A: that the beginning of a string; 

  \ Z: at the end of that string. 

  Expression: "\ AThis is a cat \ z" to match the string "This is a cat." 

  Use border positioning symbol, an important here to refer to the concept, and that is the word character, word that it can be a character word of the characters, they are [a-zA-Z0-9] in an arbitrary character.    Therefore, the above formula will be in the sentence "This is a cat." Be matched.    Matching results do not include the full stop. 

  Repeat the number of symbols 

  Let us look at expressions: "Colou? R," and this has emerged in the expression we have not seen a question mark (this question and the question mark file name matching different meaning), it said in front of a character symbols can be duplicated number "?" said 0 or 1, the expression in question before that u can occur 0 or 1, so it would match the "Color" or "Colour." 

  Below is the duplication of the number of other symbols: 

  +: 1, or several times that 

  *: 0, or repeatedly said 

  For example, we would like to express one or more spaces, we can write expressions: "+"; 

  If that specific number?    We include the introduction of flowers at (). 

  (N): n is the number of specific, that n-repeat. 

  (N, m): that at least that most of the m. 

  These symbols have limited the symbol of the match in front of a number of characters.    But if you want to repeat a number of characters, such as a word, then how do?    Once again, we use parentheses, in front of our options as to the scope of parentheses symbols, including a round here is the use of another, it is expressed as a group, such as the expression: "(this)" here this is a group, then the problem can be easily handled, the number of symbols can be used to repeat it in front of a group that the number of repeat. 

  Now repeat the word to find the problem if we are to find "the the", according to the knowledge we have so far learned, we can write expressions: 

  "\ Bthe the + \ b" 

  Expression matching is the meaning of the two middle separated from one or more spaces. 

  Similarly, we can also wrote: 

  "\ B (the +) (2)" 

  But if all possible to find repeat the word?    Our current knowledge is not enough to solve this problem, we introduce the following invoked the concept of reverse, we have seen the group parentheses can be used as borders, an expression in parentheses can be limited by a number of groups, in accordance with their in the order, and that they were the default was assigned a group, the first for the group, on the 1st, and so on.    So that can be used in reverse after the expression of the position is to use the "\ n" to refer to this group, where n is the group, was quoted.    Reverse is quoted as variables in the process, we see the following specific examples: 

  Repeat the word in front of expression, we are now using reverse can be used to write: 

  "\ B (the) + \ 1 \ b" 

  Now, if we are to matching all of the repeated words, we can rewrite the formula: 
  "\ B ([a-zA-Z] +) \ 1 \ b" 

  The last question is, if we are to match the characters is a regular expression in symbols, how do?    Right, the use of symbols to escape "\", for example, if you want to match a decimal point, then you can: "\.", But also noted that if the procedures used in the formula as "\" must, in accordance with the provisions of the string variable - "\ \" or in the preceding @ expression. 

  This chapter is only available to the rookie on a regular expression is the basis of knowledge, it is only one part, we still have many things to learn, it will be back on January 1 in the section.    In fact, the regular expression is not difficult to learn, you need is patience and practice, and if you want to master it so.    Someone said: "I do not want to know the details of car, I just want to learn how drive." If you think so too, then, you never know how to use a regular expression to resolve your problem, then you will never Regular Expression not know the real strengths. 

Bookmark it: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Google
  • DotNetKicks
  • DZone
  • Furl
  • Netvouz

Tags:

Releated Articles


0 Comments to “C # in the regular expression (1)”

No Comments. Send your comment.

Leave a Reply

You must be logged in to post a comment.