我有一个句子列表,如:
errList = [ 'Ragu ate lunch but didnt have Water for drinks',
'Rams ate lunch but didnt have Gatorade for drinks',
'Saya ate lunch but didnt have :water for drinks',
'Raghu ate lunch but didnt have water for drinks',
'Hanu ate lunch but didnt have -water for drinks',
'Wayu ate lunch but didnt have water for drinks',
'Viru ate lunch but didnt have .water 4or drinks',
'kk ate lunch & icecream but did have Water for drinks',
'M ate lunch &and icecream but did have Gatorade for drinks',
'Parker ate lunch icecream but didnt have :water for drinks',
'Sassy ate lunch and icecream but didnt have water for drinks',
'John ate lunch and icecream but didnt have -water for drinks',
'Pokey ate lunch and icecream but didnt have Water for drinks',
'Laila ate lunch and icecream but did have water 4or drinks',
]
我想在列表的每个元素中找出最长的短语/部分(短语必须超过2个单词)的句数?在下面的示例中,输出将更接近于此(最长的短语作为键并计为值):
{ 'ate lunch but didnt have': 7,
'water for drinks': 7,
'ate lunch and icecream': 4,
'didnt have water': 3,
'didnt have Water': 2 # case sensitives
}
使用re模块是不可能的,因为问题接近序列匹配或者可能使用nltk或者scikit-learn?我对NLP和scikit有一定的了解,但还不足以解决这个问题?如果我解决这个问题,我会在这里发布。
它并不太痛苦 scikit-learn
有点 numpy
foo也是。但是请注意,这里我只是预处理的默认值,如果您对数据集中的标点符号感兴趣,那么您需要调整它。
from sklearn.feature_extraction.text import CountVectorizer
# Find all the phrases >2 up to the max length
cv = CountVectorizer(ngram_range=(3, max([len(x.split(' ')) for x in errList])))
# Get the counts of the phrases
err_counts = cv.fit_transform(errList)
# Get the sum of each of the phrases
err_counts = err_counts.sum(axis=0)
# Mess about with the types, sparsity is annoying
err_counts = np.squeeze(np.asarray(err_counts))
# Retrieve the actual phrases that we're working with
feat_names = np.array(cv.get_feature_names())
# We don't have to sort here, but it's nice to if you want to print anything
err_counts_sorted = err_counts.argsort()[::-1]
feat_names = feat_names[err_counts_sorted]
err_counts = err_counts[err_counts_sorted]
# This is the dictionary that you were after
err_dict = dict(zip(feat_names, err_counts))
这是前几位的输出
11 but didnt have
10 have water for drinks
10 have water for
10 water for drinks
10 but didnt have water
10 didnt have water
9 but didnt have water for drinks
9 but didnt have water for
9 didnt have water for drinks
9 didnt have water for
如果您不想打扰外部库,只需使用stdlib即可完成此操作(尽管它可能比某些替代方案慢):
import collections
import itertools
def gen_ngrams(sentence):
words = sentence.split() # or re.findall('\b\w+\b'), or whatever
n_words = len(words)
for i in range(n_words - 2):
for j in range(i + 3, n_words):
yield ' '.join(words[i: j]) # Assume normalization of spaces
def count_ngrams(sentences):
return collections.Counter(
itertools.chain.from_iterable(
gen_ngrams(sentence) for sentence in sentences
)
)
counts = count_ngrams(errList)
dict(counts.most_common(10))
哪个让你:
{'but didnt have': 11,
'ate lunch but': 7,
'ate lunch but didnt': 7,
'ate lunch but didnt have': 7,
'lunch but didnt': 7,
'lunch but didnt have': 7,
'icecream but didnt': 4,
'icecream but didnt have': 4,
'ate lunch and': 4,
'ate lunch and icecream': 4}
不是整个解决方案,但为了帮助您一点点,以下将为您提供ngram和计数字典。然后,正如比尔贝尔在评论中指出的那样,下一步是过滤掉较短的子序列。这(也在评论中指出)意味着决定你的最大长度,或者确实是什么定义了一个短语......
from nltk import ngrams, word_tokenize
from collections import defaultdict
min_ngram_length = 1
max_ngram_length = max([len(x) for x in errList])
d = defaultdict(int)
for item in errList:
for i in range(min_ngram_length, max_ngram_length):
for ngram in ngrams(word_tokenize(item), i):
d[ngram] += 1
for pair in sorted(d.items(), key = lambda x: x[1], reverse=True):
print(pair)
使用第三方库中的工具 more_itertools
:
特定
import itertools as it
import collections as ct
import more_itertools as mit
data = [
"Ragu ate lunch but didnt have Water for drinks",
"Rams ate lunch but didnt have Gatorade for drinks",
"Saya ate lunch but didnt have :water for drinks",
"Raghu ate lunch but didnt have water for drinks",
"Hanu ate lunch but didnt have -water for drinks",
"Wayu ate lunch but didnt have water for drinks",
"Viru ate lunch but didnt have .water 4or drinks",
"kk ate lunch & icecream but did have Water for drinks",
"M ate lunch &and icecream but did have Gatorade for drinks",
"Parker ate lunch icecream but didnt have :water for drinks",
"Sassy ate lunch and icecream but didnt have water for drinks",
"John ate lunch and icecream but didnt have -water for drinks",
"Pokey ate lunch and icecream but didnt have Water for drinks",
"Laila ate lunch and icecream but did have water 4or drinks",
]
码
ngrams = []
for sentence in data:
words = sentence.split()
for n in range(3, len(words)+1):
ngrams.extend((list(mit.windowed(words, n))))
counts = ct.Counter(ngrams)
dict(counts.most_common(5))
产量
{('but', 'didnt', 'have'): 11,
('ate', 'lunch', 'but'): 7,
('lunch', 'but', 'didnt'): 7,
('ate', 'lunch', 'but', 'didnt'): 7,
('lunch', 'but', 'didnt', 'have'): 7}
另外
sentences = [sentence.split() for sentence in data]
ngrams = mit.flatten(list(mit.windowed(w, n)) for n in range(3, len(sentences)+1) for w in sentences)
counts = ct.Counter(ngrams)
dict(counts.most_common(5))