问题如何使用g ++对我的循环进行矢量化？

我在搜索时找到的介绍性链接：

正如你所看到的，大多数是C语言，但我认为它们也适用于C ++。这是我的代码：

template<typename T>
//__attribute__((optimize("unroll-loops")))
//__attribute__ ((pure))
void foo(std::vector<T> &p1, size_t start,
            size_t end, const std::vector<T> &p2) {
  typename std::vector<T>::const_iterator it2 = p2.begin();
  //#pragma simd
  //#pragma omp parallel for
  //#pragma GCC ivdep Unroll Vector
  for (size_t i = start; i < end; ++i, ++it2) {
    p1[i] = p1[i] - *it2;
    p1[i] += 1;
  }
}

int main()
{
    size_t n;
    double x,y;
    n = 12800000;
    vector<double> v,u;
    for(size_t i=0; i<n; ++i) {
        x = i;
        y = i - 1;
        v.push_back(x);
        u.push_back(y);
    }
    using namespace std::chrono;

    high_resolution_clock::time_point t1 = high_resolution_clock::now();
    foo(v,0,n,u);
    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    duration<double> time_span = duration_cast<duration<double>>(t2 - t1);

    std::cout << "It took me " << time_span.count() << " seconds.";
    std::cout << std::endl;
    return 0;
}

我使用了上面提到的提示，但我没有得到任何加速，因为示例输出显示（第一次运行没有注释这个 #pragma GCC ivdep Unroll Vector：

samaras@samaras-A15:~/Downloads$ g++ test.cpp -O3 -std=c++0x -funroll-loops -ftree-vectorize -o test
samaras@samaras-A15:~/Downloads$ ./test
It took me 0.026575 seconds.
samaras@samaras-A15:~/Downloads$ g++ test.cpp -O3 -std=c++0x -o test
samaras@samaras-A15:~/Downloads$ ./test
It took me 0.0252697 seconds.

有什么希望吗？或优化标志 O3 只是这个伎俩？任何加速此代码的建议（ foo 功能）欢迎！

我的g ++版本：

samaras@samaras-A15:~/Downloads$ g++ --version
g++ (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1

请注意，循环体是随机的。我以其他形式重写它并不感兴趣。

编辑

答案说没有什么可以做的也是可以接受的！

4641

2018-03-27 03:29

起源

所以你看看组件是否已经被矢量化了 -O3？ - Mysticial

哦该死的，不，我没有。我将通过检查这个问题来做到这一点： stackoverflow.com/questions/1289881/...好主意@Mysticial！ - gsamaras

@Mysticial也许大卫给出的答案是不需要阅读装配？ - gsamaras

我不确定编译器是否允许矢量化该循环。它是如何知道的 p1 和 p2 不要别名？ - Mysticial

不是别名你的意思是他们肯定是不同的？有 ivdep 提示我发布的一个链接描述，但我不确定是否回答你的问题@Mysticial。 - gsamaras

答案:

该 O3 flag自动打开-ftree-vectorize。 https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

-O3打开-O2指定的所有优化，并打开-finline-functions，-funswitch-loops，-fpredictive-commoning，-fgcse-after-reload，-ftree-loop-vectorize，-ftree-loop-distribute -patterns，-ftree-slp-vectorize，-fvect-cost-model，-ftree-partial-pre和-fipa-cp-clone选项

因此，在这两种情况下，编译器都在尝试进行循环向量化。

使用g ++ 4.8.2编译：

g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorize -ftree-vectorizer-verbose=1 -o test

给出这个：

Analyzing loop at test.cpp:16                                                                                                                                                                                                                                               


Vectorizing loop at test.cpp:16                                                                                                                                                                                                                                             

test.cpp:16: note: create runtime check for data references *it2$_M_current_106 and *_39                                                                                                                                                                                    
test.cpp:16: note: created 1 versioning for alias checks.                                                                                                                                                                                                                   

test.cpp:16: note: LOOP VECTORIZED.                                                                                                                                                                                                                                         
Analyzing loop at test_old.cpp:29                                                                                                                                                                                                                                               

test.cpp:22: note: vectorized 1 loops in function.                                                                                                                                                                                                                          

test.cpp:18: note: Unroll loop 7 times                                                                                                                                                                                                                                      

test.cpp:16: note: Unroll loop 7 times                                                                                                                                                                                                                                      

test.cpp:28: note: Unroll loop 1 times

编译没有 -ftree-vectorize 旗：

g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorizer-verbose=1 -o test

仅返回：

test_old.cpp:16: note: Unroll loop 7 times

test_old.cpp:28: note: Unroll loop 1 times

第16行是循环函数的开始，因此编译器肯定会对其进行向量化。检查汇编程序也确认了这一点。

我似乎在我正在使用的笔记本电脑上进行了一些积极的缓存，这使得很难准确地测量该函数运行的时间。

但是，您可以尝试其他一些事情：

使用 __restrict__ 限定符告诉编译器数组之间没有重叠。
告诉编译器阵列是否对齐 __builtin_assume_aligned （不便携）

这是我生成的代码（我删除了模板，因为您需要对不同的数据类型使用不同的对齐方式）

#include <iostream>
#include <chrono>
#include <vector>

void foo( double * __restrict__ p1,
          double * __restrict__ p2,
          size_t start,
          size_t end )
{
  double* pA1 = static_cast<double*>(__builtin_assume_aligned(p1, 16));
  double* pA2 = static_cast<double*>(__builtin_assume_aligned(p2, 16));

  for (size_t i = start; i < end; ++i)
  {
      pA1[i] = pA1[i] - pA2[i];
      pA1[i] += 1;
  }
}

int main()
{
    size_t n;
    double x, y;
    n = 12800000;
    std::vector<double> v,u;

    for(size_t i=0; i<n; ++i) {
        x = i;
        y = i - 1;
        v.push_back(x);
        u.push_back(y);
    }

    using namespace std::chrono;

    high_resolution_clock::time_point t1 = high_resolution_clock::now();
    foo(&v[0], &u[0], 0, n );
    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    duration<double> time_span = duration_cast<duration<double>>(t2 - t1);

    std::cout << "It took me " << time_span.count() << " seconds.";
    std::cout << std::endl;

    return 0;
}

就像我说的那样我在获得一致的时间测量方面遇到了麻烦，因此无法确认这是否会给你带来性能提升（甚至可能会降低！）

2018-03-27 03:47

没有不同！也许吧 -unroll-loops O2已启用，但我无法确认。如果您有任何其他建议，请使用编辑按钮（推荐：D）。 - gsamaras

是的我也尝试过，并没有区别，让我尝试一些东西，看看我能找到什么:) - David Saxon

如果你没有任何新的东西，我可以接受答案，但你必须告诉我！ - gsamaras

对不起我没时间了，我仍然对此感兴趣。今晚我会有更多的关注:)不要接受当前的答案，因为我用旧版本的gcc运行它没有注意到。 - David Saxon

好的，我会等你的，谢谢！ - gsamaras

GCC对编译器进行了扩展，创建了将使用SIMD指令的新原语。看一看这里详情。

大多数编译器都说它们会自动向量化操作，但这取决于编译器模式匹配，但正如你想象的那样，这可能非常受欢迎。

2018-03-30 12:39

有趣，但我不确定我应该在属性中传递什么尺寸，你能引导我通过吗？ - gsamaras

我认为许多架构都有128位SIMD寄存器，因此请将所有对象保持为128位宽。还要记住，SIMD不会加快数据加载和存储时间，它只会加速算术运算。 - doron

还不清楚我应该怎么做。我限制在4个维度吗？一个例子会有所帮助。 - gsamaras

问题 如何使用g ++对我的循环进行矢量化？

答案:

热门问题

问题如何使用g ++对我的循环进行矢量化？