问题在Python中迭代文件的单词

我需要遍历一个大文件的单词，该文件由一个长的长行组成。我知道逐行迭代文件的方法，但由于它的单行结构，它们在我的情况下不适用。

任何替代品？

10680

2017-10-12 19:12

起源

用缓冲区读取少量文件... my_file.read(200) - JBernardo

答案:

这实际上取决于你的定义字。但试试这个：

f = file("your-filename-here").read()
for word in f.split():
    # do something with word
    print word

这将使用空白字符作为单词边界。

当然，记得要正确打开和关闭文件，这只是一个简单的例子。

2017-10-12 19:16

长线？我认为这条线太大而不能合理地放在内存中，所以你想要某种缓冲。

首先，这是一个糟糕的格式;如果您对文件有任何控制权，请每行一个字。

如果没有，请使用以下内容：

line = ''
while True:
    word, space, line = line.partition(' ')
    if space:
        # A word was found
        yield word
    else:
        # A word was not found; read a chunk of data from file
        next_chunk = input_file.read(1000)
        if next_chunk:
            # Add the chunk to our line
            line = word + next_chunk
        else:
            # No more data; yield the last word and return
            yield word.rstrip('\n')
            return

2017-10-12 19:25

你应该记住，当你想在文件中每行写一个单词时，这个工作正常，但如果你只是想使用它，那么它就不会工作，因此它只能产生一个单词。当我们有这个时，这个不起作用 dog\ncat 大块它产生了 dog\ncat不是 dog，然后 cat。什么时候 dog\ncat 印刷它看起来不错，但这是虚幻的。 - siulkilulki

你真的应该考虑使用发电机

def word_gen(file):
    for line in file:
        for word in line.split():
            yield word

with open('somefile') as f:
    word_gen(f)

2018-02-12 03:17

有更有效的方法可以做到这一点，但从语法上讲，这可能是最短的：

 words = open('myfile').read().split()

如果内存是一个问题，你不会想要这样做，因为它会将整个内容加载到内存中，而不是迭代它。

2017-10-12 19:16

正常读入该行，然后将其拆分为空格以将其分解为单词？

就像是：

word_list = loaded_string.split()

2017-10-12 19:15

阅读完行后你可以这样做：

l = len(pattern)
i = 0
while True:
    i = str.find(pattern, i)
    if i == -1:
        break
    print str[i:i+l] # or do whatever
    i += l

亚历克斯。

2017-10-12 19:23

唐纳德·米纳建议看起来很好。简单而简短。我在前面编写的代码中使用了以下代码：

l = []
f = open("filename.txt", "rU")
for line in f:
    for word in line.split()
        l.append(word)

唐纳德米纳建议的更长版本。

2017-11-08 07:59

我回答了类似的问题之前，但我已经改进了该答案中使用的方法，这里是更新版本（从最近复制的版本回答）：

这是我完全功能性的方法，避免了阅读和分裂线。它利用了 itertools 模块：

注意python 3，替换 `itertools.imap` 同 `map`

import itertools

def readwords(mfile):
    byte_stream = itertools.groupby(
      itertools.takewhile(lambda c: bool(c),
          itertools.imap(mfile.read,
              itertools.repeat(1))), str.isspace)

    return ("".join(group) for pred, group in byte_stream if not pred)

样品用法：

>>> import sys
>>> for w in readwords(sys.stdin):
...     print (w)
... 
I really love this new method of reading words in python
I
really
love
this
new
method
of
reading
words
in
python

It's soo very Functional!
It's
soo
very
Functional!
>>>

我想在你的情况下，这将是使用该功能的方式：

with open('words.txt', 'r') as f:
    for word in readwords(f):
        print(word)

2017-11-30 01:26

问题 在Python中迭代文件的单词

答案:

注意python 3，替换 itertools.imap 同 map

热门问题

问题在Python中迭代文件的单词

注意python 3，替换 `itertools.imap` 同 `map`