Regular Expressions For Extracting Data Rating

Recently, I was working on a review scraper and needed to pull just the rating values out of strings like « 5.0 of 5 bubbles ». Regular expressions proved to be a perfect fit for easily parsing and capturing the data I needed.

Kode

novembre 17, 2023

Regular Expressions to Extract Data rating

The other day I was writing a PHP script to scrape reviews off of some websites to collect ratings data. The reviews contained text like “5.0 of 5 bubbles” that displayed the rating value and the maximum possible rating. My goal was to extract just the rating number and max rating as usable values for each review.

Initially I tried using PHP string functions like substr() and strpos() to cut out pieces of the text. But I quickly realized this would require some ugly string manipulation code that would break easily if the text format changed at all. There had to be a better way!

That’s when I remembered regular expressions would be perfect for parsing this kind of text. Regex allows you to search for matches to flexible text patterns instead of exact strings. After some tinkering, I came up with a regex that would nicely capture the rating number and max rating from the strings:

/(\d\.?\d+)\s+of\s+(\d+)\s+bubbles/

Here’s how it works:

(\d.?\d+) – Matches the rating number, which can be an integer or float
\s+ – Matches 1+ whitespace characters
of – Matches the word “of”
\s+ – Matches 1+ whitespace characters
(\d+) – Captures max rating as integers
bubbles – Matches the word “bubbles”

With this regex, I could use PHP’s preg_match function to extract exactly the pieces of data I needed from any review string containing this format. No messy string manipulation required!

The code looked like:

$str = '5.0 of 5 bubbles'; 

preg_match('/(\d\.?\d+)\s+of\s+(\d+)\s+bubbles/', $str, $matches);

$rating = $matches[1]; 
$max = $matches[2];

And voilà! $rating contained the number rating, $max contained the max rating, extracted cleanly from the string in just a couple lines. Regex for the win!

Using regular expressions unlocked an easy way for me to reliably parse out and capture the targeted data I needed from these strings. I could reuse this regex even if the text changed slightly without having to rewrite all my parsing logic. Now that’s coding efficiency and simplicity at its finest!