Program 2 – Detecting Substrings (C++ Version)
Introduction
A very common task that is often performed by programs that work with text files is the problem of locating a specific substring within the file. I am sure we’ve all done this many times when working with Word, Notepad, or other editors.
Since we don’t have a GUI or other means of displaying the contents of a file all at once, let’s modify the problem slightly. Rather than locating a specific substring within a file and then highlighting the results, as most modern programs would do, let’s write a C++ program that locates the occurrences of a specific substring within a file and then displays the occurrence number as well as a portion of the text around the found substring. It should also count the number of occurrences and indicate that number in a brief report at the end.
CS1337 – Computer Science I page: 5
The input file that we will use for this exercise is a text copy of the Declaration of Independence. You will find this in the input file “DeclOfIndep.txt” on eLearning. It is okay to hard code this file name for this project, but in the future, you might have your program ask the user for the filename. That would generalize your program to work on any input file.
Even though this is a C++ program, let’s use C-strings in this exercise for all string operations. Thus, your program will read each line of input from the file into a C-string, not into a C++ string. The functions you use to locate the substrings should be C-string functions and not members of the string class. All printing and other manipulations on strings should be done with C-strings. In short, there should be no C++ strings used in this program. (Therefore, do not #include .)
Overview
Here’s a high-level overview of what your program should do:
1) Ask the user for the substring to search for (and store it in a C-string). 2) Calculate and display each of the found occurrences of the substring in the file. For each found occurrence, your program should display (a) the location number starting at 1 and going up to the total number of locations found, and (b) the portion of the string containing the found substring. This portion should consist of the substring itself and up to 8 characters before and 8 characters after the found substring. (Note that if the substring is within 8 characters of the end of a line this won’t be possible.)
Sample Runs
Here is a sample run looking for the string “people” in the file “DeclOfIndep.txt”.
Looking for the substring "people" in file "DeclOfIndep.txt":
Location 1: String: "for one people to" Location 2: String: "icts of people, unless" Location 3: String: "s those people would r" Location 4: String: " of the people." Location 5: String: " of our people."
There were 5 occurrences of the string "people" within the file "DeclOfIndep.txt".
Or consider another run looking for the substring “oo” in the file “DeclOfIndep.txt.”
Looking for the substring "oo" in file "DeclOfIndep.txt":
Location 1: String: "public goooood." Location 2: String: "ublic goooood."
CS1337 – Computer Science I page: 6
Location 3: String: "blic goooood." Location 4: String: "lic goooood." Location 5: String: "armed troops among" Location 6: String: ". They too have" Location 7: String: "of the good People" Location 8: String: "illiam Hooper" Location 9: String: "s Lightfoot Lee" Location 10: String: "Witherspoon"
There were 10 occurrences of the string "oo" within the file "DeclOfIndep.txt".
Programming Notes:
There are several points to be made about this problem in general, and about both sample runs.
1) Both of the sample runs given above are actual data, so you can use them to test your program.
2) Note that we deliberately modified a single word “good” in the Declaration file to “goooood”. In other words, we added some extra “o’s” to the word.
This makes the point that some substrings can overlap. For example, if we search for “oo” in the word “goooood,” the first “oo” will certainly be a hit. But the second hit occurs with the second “o” of the first hit, i.e., the two instances of “oo” overlap. Because of this overlap, there are really four hits of “oo” within that word, as illustrated above, not two, and your program should find all four of them.
3) Do not use “inFile >>” to read the data from the input file. As you know, “inFile >>” tokenizes around white space and would therefore extract each word from the input file separately. Of course, this has both advantages and disadvantages depending on the circumstances. It would, for example, be a good function to use if we wanted to process within individual words only. But since our program should be able to detect substrings consisting of more than one word, “inFile >>” will not serve our purposes.
Therefore, use “inFile.getline()” (i.e., the member function version of getline() – see Chapter 10, Slide 17+.) as your primary input function. As we discussed in class, this will read a single line of input from the file at a time and place it in the target buffer, which should be a character array of adequate size.
4) Note that the file will be processed one line at a time. It is not necessary to look for substrings that span more than one line.
5) Since we are using C-strings for this assignment, we’ll have to use the C string processing functions. A very useful function for this assignment would be the strstr(const char *, const char *) function, which locates an instance of
CS1337 – Computer Science I page: 7
the right hand string inside the left hand string and returns a pointer to the found instance. (It returns a NULL if no instance is found.) This function is described on slide 27 of the Chapter 10 slide set. The strchr(const char *, int ch), not listed in the slide set, locates the first occurrence of “ch” in the string and returns a pointer to it.
6) Since all C-strings are based on character arrays, be careful about running off either end of the array. Since you are required to print not only the substring but also 8 characters to either side of it, this overrun can occur if the substring you are looking for is within 8 characters of either the beginning or the end of that line. (Often an array overrun will be detected if your program starts printing out gibberish or default characters instead of text from the Declaration.)
For an example, consider the first sample run (i.e., looking for the word “people”). Note that in locations 4 and 5, the word “people” appears not only at the end of a sentence but also at the end of a line of input. It is, therefore, impossible to display 8 characters after the found substring in those cases, since we are processing on a line-by-line basis. This is perfectly okay. If there are not 8 characters either before or after the found substring, just terminate the output report at that point.
7) Build up your solution in a modular fashion, debugging as you go. Do not attempt to write the whole program at once. If you feel lost at some point, simplify your problem down to something manageable. You might, for example, create a sample input file with a single sentence in it and see if your program can detect substrings within it. In any case, unless it is necessary to solve a problem, you should never have more than one function at once under development. Debug that function before moving on to the next. If you will code in this way, your overall development time will be much quicker.
8) Be alert to array overruns on either side as you look for substrings.
Deliverables
Please submit your C++ source code file. There is no output file on this problem.