Practice Problem Link: Substring Search - III
Please make sure to try solving the problem yourself before looking at the editorial.
Problem Statement
Given two strings s1 and s2, find the index of the first occurrence of s2 in s1 as a substring.
If no such occurrence exists, return -1.
This problem is also known as finding a needle in a haystack.
Use the KMP algorithm to solve this problem.
Approach
In a naive string search algorithm, we need to check every substring of length s2 until a matching substring is found. But in KMP the approach is optimized by not checking the sub-patterns occurring multiple times. While comparing a substring, if a mismatch occurs we move to the next substring and since the previous and current substring have some common characters which has been already visited, the idea is to skip some of the characters from these visited characters that will match anyway.
To do this efficiently, the concept of the LPS is used. LPS of a string is the longest proper prefix which is also a suffix.
Example:
String: AABCDAA
Proper prefixes: A,AA,AAB,AABC,AABCD and AABCDA.
LPS of the above-given string is AA.
To solve this problem we will create an lps array of the string s2 where lps[i] denotes the length of the lps of substring s2[0…i].
Creating LPS Array (Examples)
String S: AACA
LPS array: [0, 1, 0, 1]
String S: BACDABACD
LPS array: [0, 0, 0, 0, 0, 1, 2, 3, 4]
After creating the LPS array all that needs to be done is searching the occurrence of the string s2 in s1 with the help of the LPS array in linear time.
- Start matching both the given strings
s1ands2with initial indexes asi = 0andj = 0respectively. - While
s1[i] == s2[j]simply incrementiandj. - If
jis equal to the length of the strings2then the first occurrence ofs2is found so return the starting index of this substring. - In case of a mismatch
(s1[i] != s2[j]), we will setj = lps[j - 1], if(j > 0)as we know that all the previous characters ofs2will match anyway.
Analysis
- Time Complexity:
O(n) - Auxiliary Space Complexity:
O(n)
Implementation
C++
int findStartIndexOfSubstring(string s1, string s2) {
vector<int> lps(s2.size(), 0);
int i = 1, j = 0;
while(i < s2.size()) {
if(s2[i] == s2[j]) {
j++;
lps[i++] = j;
} else {
if(j == 0) {
lps[i++] = 0;
} else {
j = lps[j - 1];
}
}
}
i = 0;
j = 0;
while(i < s1.size()) {
if(s1[i] == s2[j]) {
i++;
j++;
}
if(j == s2.size()) {
return i - j;
} else if (i < s1.size() && s1[i] != s2[j]) {
if (j == 0) {
i++;
} else {
j = lps[j - 1];
}
}
}
return - 1;
}Java
class Solution {
int findStartIndexOfSubstring(String s1, String s2) {
int[] lps = new int[s2.length()];
int i = 1, j = 0;
while(i < s2.length()) {
if(s2.charAt(i) == s2.charAt(j)) {
j++;
lps[i++] = j;
} else {
if(j == 0) {
lps[i++] = 0;
} else {
j = lps[j - 1];
}
}
}
i = 0;
j = 0;
while(i < s1.length()) {
if(s1.charAt(i) == s2.charAt(j)) {
i++;
j++;
}
if(j == s2.length()) {
return i - j;
} else if (i < s1.length() && s1.charAt(i) != s2.charAt(j)) {
if (j == 0) {
i++;
} else {
j = lps[j - 1];
}
}
}
return - 1;
}
}