通常名词短语的头部是NP的最右边的名词,如下所示,树是父NP的头部。所以
根
|
小号
___ | ________________________
NP |
___ | _____________ |
| PP VP
| ____ | ____ ____ | ___
NP | NP | PRT
___ | _______ | | | |
DT JJ NN NN IN NNP VBD RP
| | | | | | | |
来自印度的老橡树倒下了
Out [40]:Tree('S',[Tree('NP',[Tree('NP',[Tree('DT',['The']),Tree('JJ',['old'] ),树('NN',['oak']),树('NN',['树'])]),树('PP',[树('IN',['from']),树('NP',[树('NNP',['印度'])])])]),树('VP',[树('VBD',['倒']),树('PRT ',[树('RP',['down'])])])])
以下代码 基于java实现 使用一个简单的规则来找到NP的头部,但我需要基于 规则:
parsestr='(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))'
def traverse(t):
try:
t.label()
except AttributeError:
return
else:
if t.label()=='NP':
print 'NP:'+str(t.leaves())
print 'NPhead:'+str(t.leaves()[-1])
for child in t:
traverse(child)
else:
for child in t:
traverse(child)
tree=Tree.fromstring(parsestr)
traverse(tree)
上面的代码给出了输出:
NP:[''','old','oak','tree','from','India']
NPhead:印度
NP:[''','old','oak','tree']
NPhead:树
NP: '印度']
NPhead:印度
虽然现在它给出了给出的句子的正确输出但我需要结合一个条件,只有最右边的名词被提取为头部,目前它不检查它是否是名词(NN)
print 'NPhead:'+str(t.leaves()[-1])
所以像上面代码中的np head条件一样:
t.leaves().getrightmostnoun()
迈克尔柯林斯论文(附录A) 包括Penn Treebank的头部查找规则,因此没有必要只有最右边的名词是头部。因此,上述条件应包含这种情况。
对于其中一个答案中给出的以下示例:
(给(NP谈话)的NP(NP人)回家了
主题的名词是人,但是NP的最后一个离开节点是讲话的人。
有内置的字符串 Tree
NLTK中的对象(http://www.nltk.org/_modules/nltk/tree.html),见 https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L541。
>>> from nltk.tree import Tree
>>> parsestr='(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))'
>>> for i in Tree.fromstring(parsestr).subtrees():
... if i.label() == 'NP':
... print i
...
(NP
(NP (DT The) (JJ old) (NN oak) (NN tree))
(PP (IN from) (NP (NNP India))))
(NP (DT The) (JJ old) (NN oak) (NN tree))
(NP (NNP India))
>>> for i in Tree.fromstring(parsestr).subtrees():
... if i.label() == 'NP':
... print i.leaves()
...
['The', 'old', 'oak', 'tree', 'from', 'India']
['The', 'old', 'oak', 'tree']
['India']
注意,并非总是最右边的名词是NP的头部名词,例如,
>>> s = '(ROOT (S (NP (NN Carnac) (DT the) (NN Magnificent)) (VP (VBD gave) (NP ((DT a) (NN talk))))))'
>>> Tree.fromstring(s)
Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NN', ['Carnac']), Tree('DT', ['the']), Tree('NN', ['Magnificent'])]), Tree('VP', [Tree('VBD', ['gave']), Tree('NP', [Tree('', [Tree('DT', ['a']), Tree('NN', ['talk'])])])])])])
>>> for i in Tree.fromstring(s).subtrees():
... if i.label() == 'NP':
... print i.leaves()[-1]
...
Magnificent
talk
可以说, Magnificent
仍然可以是头名词。另一个例子是当NP包含一个相关子句时:
(给(NP谈话)的NP(NP人)回家了
主题的头名是 person
NP的最后一个离开节点 the person that gave the talk
是 talk
。
我正在寻找一个使用NLTK的python脚本执行此任务并偶然发现了这篇文章。这是我提出的解决方案。它有点吵和随意,绝对不总是选择正确的答案(例如复合名词)。 但 我想发布它,以防其他人有一个主要有效的解决方案。
#!/usr/bin/env python
from nltk.tree import Tree
examples = [
'(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))',
"(ROOT\n (S\n (NP\n (NP (DT the) (NN person))\n (SBAR\n (WHNP (WDT that))\n (S\n (VP (VBD gave)\n (NP (DT the) (NN talk))))))\n (VP (VBD went)\n (NP (NN home)))))",
'(ROOT (S (NP (NN Carnac) (DT the) (NN Magnificent)) (VP (VBD gave) (NP ((DT a) (NN talk))))))'
]
def find_noun_phrases(tree):
return [subtree for subtree in tree.subtrees(lambda t: t.label()=='NP')]
def find_head_of_np(np):
noun_tags = ['NN', 'NNS', 'NNP', 'NNPS']
top_level_trees = [np[i] for i in range(len(np)) if type(np[i]) is Tree]
## search for a top-level noun
top_level_nouns = [t for t in top_level_trees if t.label() in noun_tags]
if len(top_level_nouns) > 0:
## if you find some, pick the rightmost one, just 'cause
return top_level_nouns[-1][0]
else:
## search for a top-level np
top_level_nps = [t for t in top_level_trees if t.label()=='NP']
if len(top_level_nps) > 0:
## if you find some, pick the head of the rightmost one, just 'cause
return find_head_of_np(top_level_nps[-1])
else:
## search for any noun
nouns = [p[0] for p in np.pos() if p[1] in noun_tags]
if len(nouns) > 0:
## if you find some, pick the rightmost one, just 'cause
return nouns[-1]
else:
## return the rightmost word, just 'cause
return np.leaves()[-1]
for example in examples:
tree = Tree.fromstring(example)
for np in find_noun_phrases(tree):
print "noun phrase:",
print " ".join(np.leaves())
head = find_head_of_np(np)
print "head:",
print head
对于问题和其他答案中讨论的示例,这是输出:
noun phrase: The old oak tree from India
head: tree
noun phrase: The old oak tree
head: tree
noun phrase: India
head: India
noun phrase: the person that gave the talk
head: person
noun phrase: the person
head: person
noun phrase: the talk
head: talk
noun phrase: home
head: home
noun phrase: Carnac the Magnificent
head: Magnificent
noun phrase: a talk
head: talk