PHP Hidden Gem: similar_text()
Every once in a while you need to solve this unique problem. This problem that you usually only have to solve once or twice. So you start developing, but quickly wonder what the best solution is. You can write your own solution, but it pays off to at least search through PHP.net a bit to see if there is not already a solution for it in PHP.
Today, I was having such a situation. I am working on a hobby project where I aggregate feeds from several different sources. With the blogs I work it right now, it often happens that an author posts the same post to a few different sites. However, because of site formats and sometimes also quick edits an author makes on one site but not on the author, the article contents are usually not identical strings. So I needed something that would help me figure out whether or not two strings are nearly identical.
First I was Googling around a bit, trying to figure out what was a sane approach for this. I was thinking of simply piping it into diff and parsing the response, but that seemed highly inefficient. I also came by
xdiff, a PECL extension which allows for some very cool functionality, but in the end I didn't really want to diff, I just wanted to figure out if strings were similar.
The fact that Google brought me to PHP.net though made me consider PHP might actually have a solution to my problem already. Then I remembered using
soundex() before, which is sort of what I need, however I need it for more than just a single word. I decided to check the soundex() page anyway, to see if there were similar functions listed that would perhaps help me out. And indeed there were:
levensthein and
similar_text both seemed to do something approaching what I needed.
I first checked levensthein, however it allows checking strings of maximum 255 characters, too short when comparing full blog post bodies. So I went with similar_text, and indeed, that worked fine. My current isDuplicate() method is now:
static public function isDuplicate($item1, $item2)
{
similar_text($item1->get('text' ), $item2->get('text' ), $perc);
if ($perc > 75)
{
return true;
}
}
I am still trying to figure out which percentage will catch the duplicates but not catch too many posts which are only similar but not actually duplicates, but with the above 75% I seem to catch quite a few duplicates so far.
So, next time you are in a similar situation, remember, there is an awesome function out there called similar_text(). And perhaps more importantly: Next time you need to implement some very specific functionality that you usually don't need, first try PHP.net (and
Google) before writing your own solution.
July 10, 2009 - tags: php, hidden gem, strings
prago cereal & multi grain mill: an additional perk is free parking, one thing that you can never expect in a city. how to make a traditional puerto rican thanksgiving day dinner that's why i like it with grandma's pie crust, which is not sweet. >> so that means no more soggy cereal no matter how long it takes you to finish eating.
Josh Groban Hidden Away Album Art: for more info kindly visit: given the talent and commitment of these individuals and the rising interest of younger people in the genre, i think it's safe to say that hip hop violin is here to stay. if lindsey stirling's performance on america's got talent leads people to talented groups such as nuttin' but stringz or black violin, then this exciting new genre may get the respect it deserves. dealers are hard to beat for looking for your student violin. they have the knowledge and the understanding that it takes to get you the instrument that will serve you best at this stage of your learning. plus when you are purchasing from a dealer, they don't want their reputation to be tarnished, so they make the effort to have satisfied customers. with the online auctions you can be buying a pig in a poke.
headphone mic splitter: i think monty alexander is remarkable. it is just a kid's game, but as diversions go it is fun (even for adults). most tablets are in this mode by default so this should not be an issue. the magic continued as they transcended into "presto", with its upbeat outlook of positivity in regards to viewpoint. their exhibit was big enough but needed a bit more oomph to it.
digital recorder board: one of small voice recorders is keychain digital voice recorder. sony icd-ax412f - voice recorder - flash 2 gb. ⇒text to speech and more… the mq71 digital voice pen recorder features: 'one click' recording feature (slide pocket clip down to record -- slide clip up to turn off).
Fax Number Structure: in contrast internet fax service employs a methodology which enables you to send or receive a number of documents across the network. make sure the service provider will lend you 24x7 support extended by a live person who is technically competent and can help you solve problems quickly. after setting up the fax console and completed the configuration setup,you're computer is now ready to send and receive faxes just as any fax machine would.the console pops open automatically and looks similar to the outlook express console. if you send your fax overseas, you need to insert the international code.