Bug in SLP vectorizer

Hi,

I don't know where to properly report bugs in the APPLE version of LLVM, so let me try here. The following piece of code produce an erroneous assembly code when compiled with AVX (-mavx) using clang/llvm 7.0.0:

#include <iostream>
#include <immintrin.h>


typedef double Matrix2d[4];


__attribute__((noinline))
void print(const Matrix2d &A)
{
  std::cout << A[0] << " " << A[2] << "\n" << A[1] << " " << A[3] << "\n\n";
}


double sum(const Matrix2d &mat)
{
  __m256d tmp0 = _mm256_load_pd(mat);
  __m256d tmp1 = _mm256_hadd_pd(tmp0,_mm256_permute2f128_pd(tmp0,tmp0,1));
  return _mm_cvtsd_f64(_mm256_castpd256_pd128(_mm256_hadd_pd(tmp1,tmp1)));
}


void foo(const Matrix2d &A)
{
  Matrix2d B;
  B[0] = A[0];
  B[1] = A[1];
  B[2] = 0;
  B[3] = A[3];
  double scale = sum(B);
  B[0] /= scale;
  B[1] /= scale;
  B[3] /= scale;


  print(B);
}


int main()
{
  Matrix2d A = {1, 0, 0, 1};
  foo(A);
  return 0;
}

Output:

0.5 0
0.5 0.5

Expected output obtained by compiling with -fno-slp-vectorize:

0.5 0
0 0.5

I cannot reproduce the issue with clang/llvm 3.8 form llvm.org.