Tuesday, 12 April 2016

java - Regex - find various strings from an HTML file

I have an html file called basic.html and what my task is, is to create a small Java program using regular expressions to output various strings. My program should display the line number of all of the occurrences of each of the strings below:




  • div tag

  • div class="menuItem" tag

  • span tag

  • class=”emph”

  • Any string beginning with < and ending with >, i.e. all tags.

  • The contents of the body tag.

  • The contents of all divs


  • All divs that make menus



I must also use start and end methods to display index values.



I have started my code as follows:



import java.io.IOException;
import java.util.Arrays;
import java.util.regex.Pattern;

import java.util.regex.Matcher;

public class RegexHTML {
public static void main(String[] args) throws IOException {

// Input for matching the regexe pattern
String file_name = "basic.html";

ReadFile file = new ReadFile(file_name);
String[] aryLines = file.OpenFile();

String asString = Arrays.toString(aryLines);

// Regexe to be matched
String regexe = "
";

int i;
for ( i=0; i < aryLines.length; i++ ) {
System.out.println( aryLines[ i ] ) ;
}




// Step 1: Allocate a Pattern object to compile a regexe
Pattern pattern = Pattern.compile(regexe);
//Pattern pattern = Pattern.compile(regexe, Pattern.CASE_INSENSITIVE); // case- insensitive matching

// Step 2: Allocate a Matcher object from the compiled regexe pattern,
// and provide the input to the Matcher
Matcher matcher = pattern.matcher(asString);


// Step 3: Perform the matching and process the matching result
int count = 0;
// Use method find()
while (matcher.find()) { // find the next match
System.out.println("find() found the pattern \"" + matcher.group()
+ "\" starting at index " + matcher.start()
+ " and ending at index " + matcher.end());
count++;

}

System.out.println("\nFound the pattern "+count+ " times.\n");

// Use method matches()
if (matcher.matches()) {
System.out.println("matches() found the pattern \"" + matcher.group()
+ "\" starting at index " + matcher.start()
+ " and ending at index " + matcher.end());
} else {
System.out.println("matches() found nothing");
}


// Use method lookingAt()
if (matcher.lookingAt()) {
System.out.println("lookingAt() found the pattern \"" + matcher.group()
+ "\" starting at index " + matcher.start()
+ " and ending at index " + matcher.end());
} else {
System.out.println("lookingAt() found nothing");
}


}

}


My biggest problem is how exactly am I going to be able to display all those occurrences, my code so far only gives me the index value of the div tag but I would like to have all the occurrences listed above displayed in the output.
My second problem of course is how to display the line every string occurs but I haven't really researched this yet as I'm thinking about the first question at the moment. However If you could give me a hint as to where to get started on this one too, I would appreciate it.

No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...