Python核心编程.pdf电子书下载
×Ö·û´®µÄÏàËÆÐԱȽÏÓ¦Óó¡ºÏºÜ¶à£¬Ïñƴд¾À´í¡¢Îı¾È¥ÖØ¡¢ÉÏÏÂÎÄÏàËÆÐԵȡ£
ÆÀ¼Û×Ö·û´®ÏàËƶÈ×î³£¼ûµÄ°ì·¨¾ÍÊÇ£º°ÑÒ»¸ö×Ö·û´®Í¨¹ý²åÈ롢ɾ³ý»òÌæ»»ÕâÑùµÄ±à¼²Ù×÷£¬±ä³ÉÁíÍâÒ»¸ö×Ö·û´®£¬ËùÐèÒªµÄ×îÉٱ༴ÎÊý£¬ÕâÖÖ¾ÍÊDZ༾àÀ루edit distance£©¶ÈÁ¿·½·¨£¬Ò²³ÆΪLevenshtein¾àÀë¡£º£Ã÷¾àÀëÊDZ༾àÀëµÄÒ»ÖÖÌØÊâÇé¿ö£¬Ö»¼ÆËãµÈ³¤Çé¿öÏÂÌæ»»²Ù×÷µÄ±à¼´ÎÊý£¬Ö»ÄÜÓ¦ÓÃÓÚÁ½¸öµÈ³¤×Ö·û´®¼äµÄ¾àÀë¶ÈÁ¿¡£
ÆäËû³£ÓõĶÈÁ¿·½·¨»¹ÓÐ Jaccard distance¡¢J-W¾àÀ루Jaro¨CWinkler distance£©¡¢ÓàÏÒÏàËÆÐÔ£¨cosine similarity£©¡¢Å·ÊϾàÀ루Euclidean distance£©µÈ¡£
python-Levenshtein ʹÓÃ
ʹÓà pip install python-Levenshtein Ö¸Áî°²×° Levenshtein
# -*- coding: utf-8 -*- import difflib # import jieba import Levenshtein str1 = "ÎҵĹÇ÷ÀÑ©°× Ò²³¤²»³öÇàïý" str2 = "Ñ©µÄÈÕ×Ó ÎÒÖ»Ï뵽ѩÖÐÈ¥si" # 1. difflib seq = difflib.SequenceMatcher(None, str1,str2) ratio = seq.ratio() print 'difflib similarity1: ', ratio # difflib È¥µôÁбíÖв»ÐèÒª±È½ÏµÄ×Ö·û seq = difflib.SequenceMatcher(lambda x: x in ' ÎÒµÄÑ©', str1,str2) ratio = seq.ratio() print 'difflib similarity2: ', ratio # 2. hamming¾àÀ룬str1ºÍstr2³¤¶È±ØÐëÒ»Ö£¬ÃèÊöÁ½¸öµÈ³¤×Ö´®Ö®¼ä¶ÔӦλÖÃÉϲ»Í¬×Ö·ûµÄ¸öÊý # sim = Levenshtein.hamming(str1, str2) # print 'hamming similarity: ', sim # 3. ±à¼¾àÀ룬ÃèÊöÓÉÒ»¸ö×Ö´®×ª»¯³ÉÁíÒ»¸ö×Ö´®×îÉٵIJÙ×÷´ÎÊý£¬ÔÚÆäÖеIJÙ×÷°üÀ¨ ²åÈ롢ɾ³ý¡¢Ìæ»» sim = Levenshtein.distance(str1, str2) print 'Levenshtein similarity: ', sim # 4.¼ÆËãÀ³ÎÄ˹̹±È sim = Levenshtein.ratio(str1, str2) print 'Levenshtein.ratio similarity: ', sim # 5.¼ÆËãjaro¾àÀë sim = Levenshtein.jaro(str1, str2 ) print 'Levenshtein.jaro similarity: ', sim # 6. Jaro¨CWinkler¾àÀë sim = Levenshtein.jaro_winkler(str1 , str2 ) print 'Levenshtein.jaro_winkler similarity: ', sim
Êä³ö£º
difflib similarity1: 0.246575342466
difflib similarity2: 0.0821917808219
Levenshtein similarity: 33
Levenshtein.ratio similarity: 0.27397260274
Levenshtein.jaro similarity: 0.490208958959
Levenshtein.jaro_winkler similarity: 0.490208958959
ÒÔÉϾÍÊDZ¾ÎĵÄÈ«²¿ÄÚÈÝ£¬Ï£Íû¶Ô´ó¼ÒµÄѧϰÓÐËù°ïÖú£¬Ò²Ï£Íû´ó¼Ò¶à¶àÖ§³Ö½Å±¾Ö®¼Ò¡£
转载请注明:谷谷点程序 » 详解Python 字符串相似性的几种度量方法