The other day I was writing a PHP script to scrape reviews off of some websites to collect ratings data. The reviews contained text like “5.0 of 5 bubbles” that displayed the rating value and the maximum possible rating. My goal was to extract just the rating number and max rating as usable values for each review.
Initially I tried using PHP string functions like substr() and strpos() to cut out pieces of the text. But I quickly realized this would require some ugly string manipulation code that would break easily if the text format changed at all. There had to be a better way!
That’s when I remembered regular expressions would be perfect for parsing this kind of text. Regex allows you to search for matches to flexible text patterns instead of exact strings. After some tinkering, I came up with a regex that would nicely capture the rating number and max rating from the strings:
/(\d\.?\d+)\s+of\s+(\d+)\s+bubbles/
Here’s how it works:
- (\d.?\d+) – Matches the rating number, which can be an integer or float
- \s+ – Matches 1+ whitespace characters
- of – Matches the word “of”
- \s+ – Matches 1+ whitespace characters
- (\d+) – Captures max rating as integers
- bubbles – Matches the word “bubbles”
With this regex, I could use PHP’s preg_match function to extract exactly the pieces of data I needed from any review string containing this format. No messy string manipulation required!
The code looked like:
$str = '5.0 of 5 bubbles';
preg_match('/(\d\.?\d+)\s+of\s+(\d+)\s+bubbles/', $str, $matches);
$rating = $matches[1];
$max = $matches[2];
And voilà! $rating contained the number rating, $max contained the max rating, extracted cleanly from the string in just a couple lines. Regex for the win!
Using regular expressions unlocked an easy way for me to reliably parse out and capture the targeted data I needed from these strings. I could reuse this regex even if the text changed slightly without having to rewrite all my parsing logic. Now that’s coding efficiency and simplicity at its finest!