最新消息: 新版网站上线了!!!

详解Python 字符串相似性的几种度量方法

×Ö·û´®µÄÏàËÆÐԱȽÏÓ¦Óó¡ºÏºÜ¶à£¬Ïñƴд¾À´í¡¢Îı¾È¥ÖØ¡¢ÉÏÏÂÎÄÏàËÆÐԵȡ£

ÆÀ¼Û×Ö·û´®ÏàËƶÈ×î³£¼ûµÄ°ì·¨¾ÍÊÇ£º°ÑÒ»¸ö×Ö·û´®Í¨¹ý²åÈ롢ɾ³ý»òÌæ»»ÕâÑùµÄ±à¼­²Ù×÷£¬±ä³ÉÁíÍâÒ»¸ö×Ö·û´®£¬ËùÐèÒªµÄ×îÉٱ༭´ÎÊý£¬ÕâÖÖ¾ÍÊDZ༭¾àÀ루edit distance£©¶ÈÁ¿·½·¨£¬Ò²³ÆΪLevenshtein¾àÀë¡£º£Ã÷¾àÀëÊDZ༭¾àÀëµÄÒ»ÖÖÌØÊâÇé¿ö£¬Ö»¼ÆËãµÈ³¤Çé¿öÏÂÌæ»»²Ù×÷µÄ±à¼­´ÎÊý£¬Ö»ÄÜÓ¦ÓÃÓÚÁ½¸öµÈ³¤×Ö·û´®¼äµÄ¾àÀë¶ÈÁ¿¡£

ÆäËû³£ÓõĶÈÁ¿·½·¨»¹ÓÐ Jaccard distance¡¢J-W¾àÀ루Jaro¨CWinkler distance£©¡¢ÓàÏÒÏàËÆÐÔ£¨cosine similarity£©¡¢Å·ÊϾàÀ루Euclidean distance£©µÈ¡£

python-Levenshtein ʹÓÃ

ʹÓà pip install python-Levenshtein Ö¸Áî°²×° Levenshtein

# -*- coding: utf-8 -*-
 
import difflib
# import jieba
import Levenshtein
 
str1 = "ÎҵĹÇ÷ÀÑ©°× Ò²³¤²»³öÇàïý"
str2 = "Ñ©µÄÈÕ×Ó ÎÒÖ»Ï뵽ѩÖÐÈ¥si"
 
# 1. difflib
seq = difflib.SequenceMatcher(None, str1,str2)
ratio = seq.ratio()
print 'difflib similarity1: ', ratio
 
# difflib È¥µôÁбíÖв»ÐèÒª±È½ÏµÄ×Ö·û
seq = difflib.SequenceMatcher(lambda x: x in ' ÎÒµÄÑ©', str1,str2)
ratio = seq.ratio()
print 'difflib similarity2: ', ratio
 
# 2. hamming¾àÀ룬str1ºÍstr2³¤¶È±ØÐëÒ»Ö£¬ÃèÊöÁ½¸öµÈ³¤×Ö´®Ö®¼ä¶ÔӦλÖÃÉϲ»Í¬×Ö·ûµÄ¸öÊý
# sim = Levenshtein.hamming(str1, str2)
# print 'hamming similarity: ', sim
 
# 3. ±à¼­¾àÀ룬ÃèÊöÓÉÒ»¸ö×Ö´®×ª»¯³ÉÁíÒ»¸ö×Ö´®×îÉٵIJÙ×÷´ÎÊý£¬ÔÚÆäÖеIJÙ×÷°üÀ¨ ²åÈ롢ɾ³ý¡¢Ìæ»»
sim = Levenshtein.distance(str1, str2)
print 'Levenshtein similarity: ', sim
 
# 4.¼ÆËãÀ³ÎÄ˹̹±È
sim = Levenshtein.ratio(str1, str2)
print 'Levenshtein.ratio similarity: ', sim
 
# 5.¼ÆËãjaro¾àÀë
sim = Levenshtein.jaro(str1, str2 )
print 'Levenshtein.jaro similarity: ', sim
 
# 6. Jaro¨CWinkler¾àÀë
sim = Levenshtein.jaro_winkler(str1 , str2 )
print 'Levenshtein.jaro_winkler similarity: ', sim

Êä³ö£º

difflib similarity1:  0.246575342466
difflib similarity2:  0.0821917808219
Levenshtein similarity:  33
Levenshtein.ratio similarity:  0.27397260274
Levenshtein.jaro similarity:  0.490208958959
Levenshtein.jaro_winkler similarity:  0.490208958959

ÒÔÉϾÍÊDZ¾ÎĵÄÈ«²¿ÄÚÈÝ£¬Ï£Íû¶Ô´ó¼ÒµÄѧϰÓÐËù°ïÖú£¬Ò²Ï£Íû´ó¼Ò¶à¶àÖ§³Ö½Å±¾Ö®¼Ò¡£

转载请注明:谷谷点程序 » 详解Python 字符串相似性的几种度量方法