问题识别连续出现的值

我有一个这样的df：

我想回来一个 1 如果连续出现两次或多次，则在新列中 1 在 Count 和a 0 如果没有。所以在新专栏中每一行都会得到一个 1 基于该标准在列中得到满足 Count。那么我想要的输出是：

Count  New_Value
1      0 
0      0
1      1
1      1
0      0
0      0
1      1
1      1 
1      1
0      0

我想我可能需要使用 itertools 但我一直在阅读它，并没有遇到我需要的东西。我希望能够使用此方法计算任意数量的连续出现次数，而不仅仅是2次。例如，有时我需要连续计算10次，我在这里只使用2。

2575

2018-06-21 01:56

起源

检查是否 df['Count'][1] == df['Count'][1].shift(1)，如果是这样， 1否则 0。那你应该 .append() 这些值（0或1）到 array。然后设置第一个元素（array[0]）至 0 （默认）。然后你必须弄清楚如何 merge/join/plug/concatenate 你的 array 进入你的 dataframe。 100％未经测试，但我认为这可能有用...... :) - dot.Py

我可能已经过多地简化了我的问题，如果我想要连续3次出现怎么办？我觉得这不行 - Stefano Potter

答案:

你可以：

df['consecutive'] = df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count

要得到：

   Count  consecutive
0      1            1
1      0            0
2      1            2
3      1            2
4      0            0
5      0            0
6      1            3
7      1            3
8      1            3
9      0            0

从这里你可以，任何门槛：

threshold = 2
df['consecutive'] = (df.consecutive > threshold).astype(int)

要得到：

   Count  consecutive
0      1            0
1      0            0
2      1            1
3      1            1
4      0            0
5      0            0
6      1            1
7      1            1
8      1            1
9      0            0

或者，只需一步：

(df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)

在效率方面，使用 pandas 当问题的大小增加时，方法提供了显着的加速：

 df = pd.concat([df for _ in range(1000)])

%timeit (df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)
1000 loops, best of 3: 1.47 ms per loop

相比：

%%timeit
l = []
for k, g in groupby(df.Count):
    size = sum(1 for _ in g)
    if k == 1 and size >= 2:
        l = l + [1]*size
    else:
        l = l + [0]*size    
pd.Series(l)

10 loops, best of 3: 76.7 ms per loop

2018-06-21 02:39

这是一个单行： df.assign(consecutive=df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size')).query('consecutive > @threshold') 这将适用于任何连续值（不仅是1和0） - MaxU

不确定这是否已经过优化，但您可以尝试一下：

from itertools import groupby
import pandas as pd

l = []
for k, g in groupby(df.Count):
    size = sum(1 for _ in g)
    if k == 1 and size >= 2:
        l = l + [1]*size
    else:
        l = l + [0]*size

df['new_Value'] = pd.Series(l)

df

Count   new_Value
0   1   0
1   0   0
2   1   1
3   1   1
4   0   0
5   0   0
6   1   1
7   1   1
8   1   1
9   0   0

2018-06-21 02:32

问题 识别连续出现的值

答案:

热门问题

问题识别连续出现的值