php - How to reduce a string containing the same substring repeated n times to a single instace of the substring

View Original

I have strings like 'ageage' or 'feetfeetfeet' or 'cmcmcmcmcm' and would like to reduce these to 'age', 'feet', and 'cm' respectively.

This is an intermediate step in normalization for matching across different data sources of certain classes of data fields that originally also contained numbers. The numeric parts have been removed into a separate string. All the unicode letters have been transliterated to lowercase ASCII letters with:

public static function transliterate(string $value)
{
    $transliterator = Transliterator::createFromRules(
        ':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;',
        Transliterator::FORWARD
    );
    return $transliterator->transliterate($value);
}

Also note that pluralization doesn't matter because while the examples I've provided are in English the project is normalizing mainly Turkish strings where such words would always be singular.

I expect this can be done with regex. Though I'm not entirely sure how

Answer

Solution:

I assume non regex is ok.

This method loops through half the string and tries to find a substring that if used in a str_replace returns nothing.
If we find that then the know it's a repeating word.

$str = 'feetfeetfeet';
$return = $str; // return full str if it fails

$len = strlen($str);

for($i = 1; $i < $len/2; $i++){
    $sub = substr($str, 0, $i);
    if(str_replace($sub, "", $str) == ""){
        $return = $sub;
        break;
    }
}

echo $return; //feet

Answer

Solution:

This looks similar to finding longest common prefix which is also a suffix. Now, the length - longest prefix which is also a suffix is your answer. You can find the algorithm of building the prefix suffix table from this .

Time complexity is O(n) and space complexity is O(n).

Snippet:

<?php

$str = "feetfeetfeet";
$length = strlen($str);

$prefix_suffix_table = array_fill(0, $length, 0);

$j = 0;
for($i = 1; $i < $length; ++$i){
    while($j > 0 && $str[$i] != $str[$j]){
        $j = $prefix_suffix_table[$j - 1];
    }

    if($str[$i] == $str[$j]){
        $prefix_suffix_table[$i] = ++$j;
    }
}

echo substr($str, 0, $length - end($prefix_suffix_table));

Demo: http://sandbox.onlinephpfunctions.com/code/b401c75cde38a51a561b53bb0a6294eb615b208c

Note: If your string is malformed like xyz which doesn't have a repeating substring, you can just add an additional check using str_repeat() and throw an exception if required.

Answer

Solution:

You can also use str_split() to convert the string into array and find its unique elements and then again return implode all the unique elements together.

<?php
$str = array_unique(str_split('ageage'));
$result = implode($str);
?>

Output

age

Answer

Solution:

I have figured out how to do this with a regex. Even though I have realized that it might not be useful for my purposes because mmmm can be both 2x mm (millimeter) or 4x m (meters). Though If I only care about supporting up to 3 repetitions I can use:

if(preg_match('/^([a-z]*)\1{2}$/', $input, $matches)) {
    $repeating = $matches[1];
    $reps = 3;
} elseif(if(preg_match('/^([a-z]*)\1$/', $input, $matches)) {
    $repeating = $matches[1];
    $reps = 2;
} else {
    $repeating = $input;
    $reps = 1;
}

Not that the following will divide the string into the smallest prime number of repeats:

preg_match('/^([a-z]*)\1+$/', $input, $matches);
$repeating = $matches[1];

Here is a table of the outputs of this:

?�??�??�??�??�??�??�??�??�??�??�??�??�??��?�??�??�??�??�??�??�??�??�??�??�??�??�?
?��   $input   ?�� $repeating ?��
?�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??��
?�� mm         ?�� m          ?��
?�� mmm        ?�� m          ?��
?�� mmmm       ?�� mm         ?��
?�� mmmmm      ?�� m          ?��
?�� mmmmmm     ?�� mmm        ?��
?�� mmmmmmm    ?�� m          ?��
?�� mmmmmmmm   ?�� mmmm       ?��
?�� mmmmmmmmm  ?�� mmm        ?��
?�� mmmmmmmmmm ?�� mmmmm      ?��
?��?�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�?

Because only the smalles prime subdivisions are considered

preg_match('/^([a-z]*)\1{1,2}$/', $input, $matches)

is unsuitable as it will, like in the above table, find the repeating part of 'mmmmmm' to be 'mmm' instead of the desired mm.

The three case implementation I have provided at the beginning is what I am currently using because my input is generally either age groups or dimensions for products and I have yet to see a product be described with more than three dimensions or with an age group like '11yr,12yr,13yr,14yr' though I can imagine something like the latter, however rare, eventually occurring. Thus I will probably move away from this method and switch to extracting the units from the original string containing the numbers with preg_match_all:

preg_match_all('/([0-9]+)\s*([a-z]*)\s*/', $input, $matches)

However in case someone else is actually interested in finding the smallest repeating substring (so 'm' for 'mmmm') this can be done with a regex in a loop:

$repeating = $input;
while(preg_match('/^([a-z]*)\1+$/', $repeating, $matches)) {
    $repeating = $matches[1];
}

This will produce:

?�??�??�??�??�??�??�??�??�??�??�??�??�??��?�??�??�??�??�??�??�??�??�??�??�??�??�?
?��   $input   ?�� $repeating ?��
?�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??��
?�� mm         ?�� m          ?��
?�� mmm        ?�� m          ?��
?�� mmmm       ?�� m          ?��
?�� mmmmm      ?�� m          ?��
?�� mmmmmm     ?�� m          ?��
?�� mmmmmmm    ?�� m          ?��
?�� mmmmmmmm   ?�� m          ?��
?�� mmmmmmmmm  ?�� m          ?��
?�� mmmmmmmmmm ?�� m          ?��
?�� cmcm       ?�� cm         ?��
?�� cmcmcm     ?�� cm         ?��
?�� cmcmcmcm   ?�� cm         ?��
?�� cmcmcmcmcm ?�� cm         ?��
?��?�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�??�?

Source