Hey there
I finally found some time to write something new, I think this is gonna be quite useful:
What we’re writing, is a tokenizer for strings, better said for search strings like the ones used e.g. in google, but a bit more simplified.
Here is an example for a string passed to this function:
author:vinci -free content:/tokenizer/ -”not this string”
The output generated by our code would look like this:
array(array(“type” => “string”, “target” => “author”, “not” => false, “value” => “vinci”), array(“type” => “string”, “target” => “none”, “not” => true, “value” => “free”), array(“type” => “regex”, “target” => “content”, “not” => false, “value” => “tokenizer”), array(“type” => “string”, “target” => “none”, “not” => true, “value” => “not this string”))
The output can be easily used in a foreach loop for easy filtering from a SQL database or PHP array containing the data.
Of course the possible combinations can be adapted by yourself, this special case I’m writing here is used for filtering RSS feeds (author, content, title, feed-url).
Ok lets start this:
public function tokenize($filter)
{
//Main output array//
$output = array();
//Incomplete flags//
$incompleteregex = false;
$incompletestring = false;
//Output array item as described in func description//
$tempitem = array();
}
These are only inits, the main output array containing all arrays with our single filter items, flags if we have a regex or string started that we have to complete, and our single array items, which are an array by themselves.
(Note: Code from now on is added on the bottom of our function, but still inside it).
We need a Loop to iterate through all our chars:
//Main loop//
for($i = 0; $i < strlen($filter); $i++)
{
and we want to know the actual char we’re processing:
//Get the actual char//
$actualchar = substr($filter, $i, 1);
Ok, now the first thing we want to do, is check if we have a minus sign “-”, but ONLY if our array is still empty. If our array contains any value already, it means that a minus sign may not occur, or if it does it’s part of our regex or string:
//Check for minus only if $tempitem still empty//
if(empty($tempitem))
{
//Set target to content by default//
$tempitem['target'] = "content";
$tempitem['type'] = "string";
//Initialize value//
$tempitem['value'] = "";
//If empty $tempitem and whitespace => ignore//
if(preg_match('/\s/',$actualchar))
continue;
if(strcmp($actualchar, '-') === 0)
{
$tempitem['not'] = true;
continue;
}
else
$tempitem['not'] = false;
}
So what we are doing here, is:
- If our array is empty, we initialize it with some default values, so this code will not be executed anymore for the actual array.
- If the char is a whitespace we’re ignoring it (We could also use trim onto our passed value, but that would work only for start and end, not for the values in between (those are also separated by whitespaces, and we want to ignore multiple ones).
- finally we compare the actual char to “-”, if it is, we set our “not” value to true and continue on to the next char, else we set “not” to false and don’t continue, but finish the loop to see how we need to process the char.
Ok, so at this point we’ve processed all not-needed whitespaces and the minus sign. The next thing that may occur is a scope for the filter, say “author” or “title”. The characteristic of these values is, that they are terminated by a “:”, therefore we can check on the “:” sign and compare what we’ve stored as string so far:
//Check for : //
if(strcmp($actualchar, ':') === 0)
{
//Now see if we have a valid value stored in value//
if(strcmp($tempitem['value'], "author") === 0)
{
$tempitem['target'] = "author";
$tempitem['value'] = "";
continue;
}
else if(strcmp($tempitem['value'], "url") === 0)
{
$tempitem['target'] = "url";
$tempitem['value'] = "";
continue;
}
else if(strcmp($tempitem['value'], "title") === 0)
{
$tempitem['target'] = "title";
$tempitem['value'] = "";
continue;
}
}
What we did, is: If we read a “:” char, we compare all our chars stored into $tempitem['value'] (the storing is done at the end of the function, we’ll come to this later), or rather we compare the string we’ve stored, with one of the scopes given above (title, url, author). If we do have a exact match, we can set our target to the according value and clear our value we had up to now, because the scope is stored in “target” and we want keywords in “value” only.
The next step is – if our value is empty – to check the actual char for the start sign of a string or a regex:
//If empty, check for start of string or regex//
if(empty($tempitem['value']) !$incompletestring && !$incompleteregex)
{
if(strcmp($actualchar, "/") === 0)
{
$tempitem['type'] = "regex";
$incompleteregex = true;
continue;
}
else if(strcmp($actualchar, '"') === 0)
{
$tempitem['type'] = "string";
$incompletestring = true;
continue;
}
}
We set the according flags, that tells us that all subsequent chars are to be interpreted as part of a string or regex and therefore whitespaces do not trigger a new array item. We do not jump into this code, if our value is still empty but one of our incomplete flags is set, because that would mess up everything (our flags would still be set).
The next part of code is run if we have a incomplete flag set, the actual char is the according ending char for the regex or string (depending on our flag), and our value is not empty. If we have a regex, we test for a valid regex, else we throw an exception. Finally we need to check, if we have processed the last char of the string, because in this case we need to push the last item onto the output array now already:
//regex or string to finish?//
if(!empty($tempitem['value']) && $incompleteregex && strcmp($actualchar, "/") === 0)
{
$incompleteregex = false;
//Check for good regex//
if(preg_match("§".$tempitem['value']."§", "http://w3.ibm.com") === false)
throw new InvalidArgumentException("The regex ".$tempitem['value']." is not a valid regex!");
//Last char?//
if($i == (strlen($filter) -1))
{
//remove whitespaced//
$tempitem['value'] = trim($tempitem['value']);
//Push output//
array_push($output,$tempitem);
}
continue;
}
else if(!empty($tempitem['value']) && $incompletestring && strcmp($actualchar, '"') === 0)
{
$incompletestring = false;
//Last char?//
if($i == (strlen($filter) -1))
{
//remove whitespaced//
$tempitem['value'] = trim($tempitem['value']);
//Push output//
array_push($output,$tempitem);
}
continue;
}
And finally, this takes us to the part, were – if no special case happend until now – we add our char to the “value”:
//If no check applied, add our char//
$tempitem['value'] .= $actualchar;
</php>
And the last part: If we have no incomplete flag set and read a whitespace, we push the actual temp-array into our output arraay and reset the temp-array:
<pre lang="php">
//No regex & string to finish and whitespace//
if(!$incompleteregex && !$incompletestring && (preg_match('/\s/',$actualchar) || ($i == (strlen($filter) -1))))
{
//remove whitespaced//
$tempitem['value'] = trim($tempitem['value']);
//Push output//
array_push($output,$tempitem);
//Reset values//
$tempitem = array();
$incompleteregex = false;
$incompletestring = false;
}
}
return $output;
}
Don’t forget to reset the flags and return the output array
That’s it, hope it helped.. the complete code will be uploaded in the next days in the codeland lib, in the “utilityfunctions.php”.
Have a nice day.