0x11223344 için 0x1234 dönüştürmek

SORU

14 ŞUBAT 2014, Cuma

0x11223344 için 0x1234 dönüştürmek

Nasıl yüksek performanslı bir şekilde 0x11223344 için onaltılık sayı 0x1234 genişletmek?

unsigned int c = 0x1234, b;
b = (c & 0xff) << 4 | c & 0xf | (c & 0xff0) << 8
        | (c & 0xff00) << 12 | (c & 0xf000) << 16;
printf("%p -> %p\n", c, b);

Çıkış:

0x1234 -> 0x11223344

Renk dönüşümü için buna ihtiyacım var. Kullanıcıların form verilerini 0xARGB ve 0xAARRGGBB dönüştürmek istiyorum bulunur. Ve evet, her bir piksel olabilir çünkü milyonlarca olabilir. 1000x1000 piksel bir milyon kadardır.

Gerçek durum 32-bit tek bir değer hem ön plan ve arka plan renklerini içerdiği için daha da karmaşıktır. Yani 0xARGBargb olmak: [ 0xAARRGGBB, 0xaarrggbb ]

Oh evet, bir şey daha var, gerçek bir uygulama ben de inkar alfa, çünkü OpenGL 0xFF olmayan şeffaf ve 0x00 en şeffaf, en uygunsuz durumlarda, çünkü genellikle sadece bir RGB bölüm ve şeffaflık kabul edilmesi için mevcut olmayan.

CEVAP

14 ŞUBAT 2014, Cuma

Bu işlem aşağıdaki gibi SSE2 kullanarak olabilir

void ExpandSSE2(unsigned __int64 in, unsigned __int64 &outLo, unsigned __int64 &outHi) {
  __m128i const mask = _mm_set1_epi16((short)0xF00F);
  __m128i const mul0 = _mm_set1_epi16(0x0011);
  __m128i const mul1 = _mm_set1_epi16(0x1000);
  __m128i       v;

  v = _mm_cvtsi64_si128(in);    // move the 64-bit value to a 128-bit register
  v = _mm_unpacklo_epi8(v, v);  // 0x12   -> 0x1212
  v = _mm_and_si128(v, mask);   // 0x1212 -> 0x1002
  v = _mm_mullo_epi16(v, mul0); // 0x1002 -> 0x1022
  v = _mm_mulhi_epu16(v, mul1); // 0x1022 -> 0x0102
  v = _mm_mullo_epi16(v, mul0); // 0x0102 -> 0x1122

  outLo = _mm_extract_epi64(v, 0);
  outHi = _mm_extract_epi64(v, 1);
}

Elbette bir iç döngü işlevi cesareti koymak ve sabitler çekmek istersin. Ayrıca 64 kaydeder ve yük değerleri doğrudan 128-bit SSE kayıtları içine atlamak isteyeceksiniz. Yapmak için nasıl bir örnek için bu performans SSE2 aşağıdaki uygulama testi için başvurun.

Özünde, bir anda 4 renk değerlerini, operasyonu gerçekleştiren 5 talimatları vardır. Yani, bu renk değeri başına sadece 1.25 talimatları. Ayrıca SSE2 64 bulunan her yerde kullanılabilir olduğunu belirtmek gerekir.

Çözümler burada bir ürün yelpazesine için performans testleri
Bir kaç kişi daha hızlı olanı bilmenin tek yolu, kodu çalıştırmak için olduğunu söylemiş, bu tartışmasız bu güne kadar yapılmış doğru. Elma ile elma karşılaştırmak, böylece performans testi içine çözümlerden bir kaçını derledik. Hissettim çözümler diğerlerinden test gerektirecek şekilde önemli ölçüde farklı yetti seçtim. Tüm çözümler bellekten okunan veri üzerinde işlem, bir bellek yazma. Uygulamada SSE çözümleri bazı giriş verileri işlemek için başka bir tam 16 bayt yok zaman hizalama ve işleme davalar için ek bakım gerektirir. Ben test sürümün altında 64 derlenmiş 4 GHz için i7 VS2013 çalışan kullanıyor.

İşte benim sonuçlar:

ExpandOrig:               56.234 seconds  // from askers original question
ExpandSmallLUT:           30.209 seconds  // from Dmitry's answer
ExpandLookupSmallOneLUT:  33.689 seconds  // from Dmitry's answer
ExpandLookupLarge:        51.312 seconds  // a straightforward lookup table
ExpandAShelly:            43.829 seconds  // from AShelly's answer
ExpandAShellyMulOp:       43.580 seconds  // AShelly's answer with an optimization
ExpandSSE4:               17.854 seconds  // my original SSE4 answer
ExpandSSE4Unroll:         17.405 seconds  // my original SSE4 answer with loop unrolling
ExpandSSE2:               17.281 seconds  // my current SSE2 answer
ExpandSSE2Unroll:         17.152 seconds  // my current SSE2 answer with loop unrolling

Göreceksin yukarıdaki test sonuçlarında askers kodu, küçük bir arama tablosu uygulama Dmitry cevap olarak önerilenler de dahil olmak üzere üç arama tablosu uygulamaları dahil ettim. AShelly çözüm de dahil olarak yapmış olduğum bir optimizasyon ile bir sürüm (ameliyatla ortadan kaldırılabilir). Ben dahil benim orijinal SSE4 uygulanmasının yanı sıra, üstün SSE2 sürümü yaptım daha sonra (şimdi yansıtılan cevap olarak içe kıvrık sürümleri her iki yana olduklarını en hızlı burada ve görmek istedim ne kadar çözümü hızlandırdı, o kadar. Ben de AShelly cevap SSE4 bir uygulama dahil.

Şimdiye kadar kendimi galip ilan var. Ama kaynak aşağıda, herkes kendi platformunda test, ve eğer daha da hızlı bir çözüm yapmışlar olmadığını görmek için test içine kendi çözümü vardır.

#define DATA_SIZE_IN  ((unsigned)(1024 * 1024 * 128))
#define DATA_SIZE_OUT ((unsigned)(2 * DATA_SIZE_IN))
#define RERUN_COUNT   500
#include <cstdlib>
#include <ctime>
#include <iostream>
#include <utility>
#include <emmintrin.h> // SSE2
#include <tmmintrin.h> // SSSE3
#include <smmintrin.h> // SSE4

void ExpandOrig(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  unsigned u, v;
  do {
    // read in data
    u  = *(unsigned const*)in;
    v  = u >> 16;
    u &= 0x0000FFFF;

    // do computation
    u  =   (u & 0x00FF) << 4
         | (u & 0x000F)
         | (u & 0x0FF0) << 8
         | (u & 0xFF00) << 12
         | (u & 0xF000) << 16;
    v  =   (v & 0x00FF) << 4
         | (v & 0x000F)
         | (v & 0x0FF0) << 8
         | (v & 0xFF00) << 12
         | (v & 0xF000) << 16;

    // store data
    *(unsigned*)(out)      = u;
    *(unsigned*)(out   4)  = v;
    in                     = 4;
    out                    = 8;
  } while (in != past);
}

unsigned LutLo[256],
         LutHi[256];
void MakeLutLo(void) {
  for (unsigned i = 0, x; i < 256;   i) {
    x        = i;
    x        = ((x & 0xF0) << 4) | (x & 0x0F);
    x       |= (x << 4);
    LutLo[i] = x;
  }
}
void MakeLutHi(void) {
  for (unsigned i = 0, x; i < 256;   i) {
    x        = i;
    x        = ((x & 0xF0) << 20) | ((x & 0x0F) << 16);
    x       |= (x << 4);
    LutHi[i] = x;
  }
}

void ExpandLookupSmall(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  unsigned u, v;
  do {
    // read in data
    u  = *(unsigned const*)in;
    v  = u >> 16;
    u &= 0x0000FFFF;

    // do computation
    u = LutHi[u >> 8] | LutLo[u & 0xFF];
    v = LutHi[v >> 8] | LutLo[v & 0xFF];

    // store data
    *(unsigned*)(out)      = u;
    *(unsigned*)(out   4)  = v;
    in                     = 4;
    out                    = 8;
  } while (in != past);
}

void ExpandLookupSmallOneLUT(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  unsigned u, v;
  do {
    // read in data
    u = *(unsigned const*)in;
    v = u >> 16;
    u &= 0x0000FFFF;

    // do computation
    u = ((LutLo[u >> 8] << 16) | LutLo[u & 0xFF]);
    v = ((LutLo[v >> 8] << 16) | LutLo[v & 0xFF]);

    // store data
    *(unsigned*)(out) = u;
    *(unsigned*)(out   4) = v;
    in   = 4;
    out  = 8;
  } while (in != past);
}

unsigned LutLarge[256 * 256];
void MakeLutLarge(void) {
  for (unsigned i = 0; i < (256 * 256);   i)
    LutLarge[i] = LutHi[i >> 8] | LutLo[i & 0xFF];
}

void ExpandLookupLarge(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  unsigned u, v;
  do {
    // read in data
    u  = *(unsigned const*)in;
    v  = u >> 16;
    u &= 0x0000FFFF;

    // do computation
    u = LutLarge[u];
    v = LutLarge[v];

    // store data
    *(unsigned*)(out)      = u;
    *(unsigned*)(out   4)  = v;
    in                     = 4;
    out                    = 8;
  } while (in != past);
}

void ExpandAShelly(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  unsigned u, v, w, x;
  do {
    // read in data
    u  = *(unsigned const*)in;
    v  = u >> 16;
    u &= 0x0000FFFF;

    // do computation
    w  = (((u & 0xF0F) * 0x101) & 0xF000F)   (((u & 0xF0F0) * 0x1010) & 0xF000F00);
    x  = (((v & 0xF0F) * 0x101) & 0xF000F)   (((v & 0xF0F0) * 0x1010) & 0xF000F00);
    w  = w * 0x10;
    x  = x * 0x10;

    // store data
    *(unsigned*)(out)      = w;
    *(unsigned*)(out   4)  = x;
    in                     = 4;
    out                    = 8;
  } while (in != past);
}

void ExpandAShellyMulOp(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  unsigned u, v;
  do {
    // read in data
    u = *(unsigned const*)in;
    v = u >> 16;
    u &= 0x0000FFFF;

    // do computation
    u = ((((u & 0xF0F) * 0x101) & 0xF000F)   (((u & 0xF0F0) * 0x1010) & 0xF000F00)) * 0x11;
    v = ((((v & 0xF0F) * 0x101) & 0xF000F)   (((v & 0xF0F0) * 0x1010) & 0xF000F00)) * 0x11;

    // store data
    *(unsigned*)(out) = u;
    *(unsigned*)(out   4) = v;
    in  = 4;
    out  = 8;
  } while (in != past);
}

void ExpandSSE4(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  __m128i const mask0 = _mm_set1_epi16((short)0x8000),
                mask1 = _mm_set1_epi8(0x0F),
                mul = _mm_set1_epi16(0x0011);
  __m128i       u, v, w, x;
  do {
    // read input into low 8 bytes of u and v
    u = _mm_load_si128((__m128i const*)in);

    v = _mm_unpackhi_epi8(u, u);      // expand each single byte to two bytes
    u = _mm_unpacklo_epi8(u, u);      // do it again for v
    w = _mm_srli_epi16(u, 4);         // copy the value into w and shift it right half a byte
    x = _mm_srli_epi16(v, 4);         // do it again for v
    u = _mm_blendv_epi8(u, w, mask0); // select odd bytes from w, and even bytes from v, giving the the desired value in the upper nibble of each byte
    v = _mm_blendv_epi8(v, x, mask0); // do it again for v
    u = _mm_and_si128(u, mask1);      // clear the all the upper nibbles
    v = _mm_and_si128(v, mask1);      // do it again for v
    u = _mm_mullo_epi16(u, mul);      // multiply each 16-bit value by 0x0011 to duplicate the lower nibble in the upper nibble of each byte
    v = _mm_mullo_epi16(v, mul);      // do it again for v

    // write output
    _mm_store_si128((__m128i*)(out     ), u);
    _mm_store_si128((__m128i*)(out   16), v);
    in   = 16;
    out  = 32;
  } while (in != past);
}

void ExpandSSE4Unroll(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  __m128i const mask0  = _mm_set1_epi16((short)0x8000),
                mask1  = _mm_set1_epi8(0x0F),
                mul    = _mm_set1_epi16(0x0011);
  __m128i       u0, v0, w0, x0,
                u1, v1, w1, x1,
                u2, v2, w2, x2,
                u3, v3, w3, x3;
  do {
    // read input into low 8 bytes of u and v
    u0 = _mm_load_si128((__m128i const*)(in     ));
    u1 = _mm_load_si128((__m128i const*)(in   16));
    u2 = _mm_load_si128((__m128i const*)(in   32));
    u3 = _mm_load_si128((__m128i const*)(in   48));

    v0 = _mm_unpackhi_epi8(u0, u0);      // expand each single byte to two bytes
    u0 = _mm_unpacklo_epi8(u0, u0);      // do it again for v
    v1 = _mm_unpackhi_epi8(u1, u1);      // do it again 
    u1 = _mm_unpacklo_epi8(u1, u1);      // again for u1
    v2 = _mm_unpackhi_epi8(u2, u2);      // again for v1
    u2 = _mm_unpacklo_epi8(u2, u2);      // again for u2
    v3 = _mm_unpackhi_epi8(u3, u3);      // again for v2
    u3 = _mm_unpacklo_epi8(u3, u3);      // again for u3
    w0 = _mm_srli_epi16(u0, 4);          // copy the value into w and shift it right half a byte
    x0 = _mm_srli_epi16(v0, 4);          // do it again for v
    w1 = _mm_srli_epi16(u1, 4);          // again for u1
    x1 = _mm_srli_epi16(v1, 4);          // again for v1
    w2 = _mm_srli_epi16(u2, 4);          // again for u2
    x2 = _mm_srli_epi16(v2, 4);          // again for v2
    w3 = _mm_srli_epi16(u3, 4);          // again for u3
    x3 = _mm_srli_epi16(v3, 4);          // again for v3
    u0 = _mm_blendv_epi8(u0, w0, mask0); // select even bytes from w, and odd bytes from v, giving the the desired value in the upper nibble of each byte
    v0 = _mm_blendv_epi8(v0, x0, mask0); // do it again for v
    u1 = _mm_blendv_epi8(u1, w1, mask0); // again for u1
    v1 = _mm_blendv_epi8(v1, x1, mask0); // again for v1
    u2 = _mm_blendv_epi8(u2, w2, mask0); // again for u2
    v2 = _mm_blendv_epi8(v2, x2, mask0); // again for v2
    u3 = _mm_blendv_epi8(u3, w3, mask0); // again for u3
    v3 = _mm_blendv_epi8(v3, x3, mask0); // again for v3
    u0 = _mm_and_si128(u0, mask1);       // clear the all the upper nibbles
    v0 = _mm_and_si128(v0, mask1);       // do it again for v
    u1 = _mm_and_si128(u1, mask1);       // again for u1
    v1 = _mm_and_si128(v1, mask1);       // again for v1
    u2 = _mm_and_si128(u2, mask1);       // again for u2
    v2 = _mm_and_si128(v2, mask1);       // again for v2
    u3 = _mm_and_si128(u3, mask1);       // again for u3
    v3 = _mm_and_si128(v3, mask1);       // again for v3
    u0 = _mm_mullo_epi16(u0, mul);       // multiply each 16-bit value by 0x0011 to duplicate the lower nibble in the upper nibble of each byte
    v0 = _mm_mullo_epi16(v0, mul);       // do it again for v
    u1 = _mm_mullo_epi16(u1, mul);       // again for u1
    v1 = _mm_mullo_epi16(v1, mul);       // again for v1
    u2 = _mm_mullo_epi16(u2, mul);       // again for u2
    v2 = _mm_mullo_epi16(v2, mul);       // again for v2
    u3 = _mm_mullo_epi16(u3, mul);       // again for u3
    v3 = _mm_mullo_epi16(v3, mul);       // again for v3

    // write output
    _mm_store_si128((__m128i*)(out      ), u0);
    _mm_store_si128((__m128i*)(out    16), v0);
    _mm_store_si128((__m128i*)(out    32), u1);
    _mm_store_si128((__m128i*)(out    48), v1);
    _mm_store_si128((__m128i*)(out    64), u2);
    _mm_store_si128((__m128i*)(out    80), v2);
    _mm_store_si128((__m128i*)(out    96), u3);
    _mm_store_si128((__m128i*)(out   112), v3);
    in   = 64;
    out  = 128;
  } while (in != past);
}

void ExpandSSE2(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  __m128i const mask = _mm_set1_epi16((short)0xF00F),
                mul0 = _mm_set1_epi16(0x0011),
                mul1 = _mm_set1_epi16(0x1000);
  __m128i       u, v;
  do {
    // read input into low 8 bytes of u and v
    u = _mm_load_si128((__m128i const*)in);

    v = _mm_unpackhi_epi8(u, u);      // expand each single byte to two bytes
    u = _mm_unpacklo_epi8(u, u);      // do it again for v

    u = _mm_and_si128(u, mask);
    v = _mm_and_si128(v, mask);
    u = _mm_mullo_epi16(u, mul0);
    v = _mm_mullo_epi16(v, mul0);
    u = _mm_mulhi_epu16(u, mul1);     // this can also be done with a right shift of 4 bits, but this seems to mesure faster
    v = _mm_mulhi_epu16(v, mul1);
    u = _mm_mullo_epi16(u, mul0);
    v = _mm_mullo_epi16(v, mul0);

    // write output
    _mm_store_si128((__m128i*)(out     ), u);
    _mm_store_si128((__m128i*)(out   16), v);
    in   = 16;
    out  = 32;
  } while (in != past);
}

void ExpandSSE2Unroll(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  __m128i const mask = _mm_set1_epi16((short)0xF00F),
                mul0 = _mm_set1_epi16(0x0011),
                mul1 = _mm_set1_epi16(0x1000);
  __m128i       u0, v0,
                u1, v1;
  do {
    // read input into low 8 bytes of u and v
    u0 = _mm_load_si128((__m128i const*)(in     ));
    u1 = _mm_load_si128((__m128i const*)(in   16));

    v0 = _mm_unpackhi_epi8(u0, u0);      // expand each single byte to two bytes
    u0 = _mm_unpacklo_epi8(u0, u0);      // do it again for v
    v1 = _mm_unpackhi_epi8(u1, u1);      // do it again 
    u1 = _mm_unpacklo_epi8(u1, u1);      // again for u1

    u0 = _mm_and_si128(u0, mask);
    v0 = _mm_and_si128(v0, mask);
    u1 = _mm_and_si128(u1, mask);
    v1 = _mm_and_si128(v1, mask);

    u0 = _mm_mullo_epi16(u0, mul0);
    v0 = _mm_mullo_epi16(v0, mul0);
    u1 = _mm_mullo_epi16(u1, mul0);
    v1 = _mm_mullo_epi16(v1, mul0);

    u0 = _mm_mulhi_epu16(u0, mul1);
    v0 = _mm_mulhi_epu16(v0, mul1);
    u1 = _mm_mulhi_epu16(u1, mul1);
    v1 = _mm_mulhi_epu16(v1, mul1);

    u0 = _mm_mullo_epi16(u0, mul0);
    v0 = _mm_mullo_epi16(v0, mul0);
    u1 = _mm_mullo_epi16(u1, mul0);
    v1 = _mm_mullo_epi16(v1, mul0);

    // write output
    _mm_store_si128((__m128i*)(out     ), u0);
    _mm_store_si128((__m128i*)(out   16), v0);
    _mm_store_si128((__m128i*)(out   32), u1);
    _mm_store_si128((__m128i*)(out   48), v1);

    in   = 32;
    out  = 64;
  } while (in != past);
}

void ExpandAShellySSE4(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  __m128i const zero      = _mm_setzero_si128(),
                v0F0F     = _mm_set1_epi32(0x0F0F),
                vF0F0     = _mm_set1_epi32(0xF0F0),
                v0101     = _mm_set1_epi32(0x0101),
                v1010     = _mm_set1_epi32(0x1010),
                v000F000F = _mm_set1_epi32(0x000F000F),
                v0F000F00 = _mm_set1_epi32(0x0F000F00),
                v0011 = _mm_set1_epi32(0x0011);
  __m128i       u, v, w, x;
  do {
    // read in data
    u = _mm_load_si128((__m128i const*)in);
    v = _mm_unpackhi_epi16(u, zero);
    u = _mm_unpacklo_epi16(u, zero);

    // original source: ((((a & 0xF0F) * 0x101) & 0xF000F)   (((a & 0xF0F0) * 0x1010) & 0xF000F00)) * 0x11;
    w = _mm_and_si128(u, v0F0F);
    x = _mm_and_si128(v, v0F0F);
    u = _mm_and_si128(u, vF0F0);
    v = _mm_and_si128(v, vF0F0);
    w = _mm_mullo_epi32(w, v0101); // _mm_mullo_epi32 is what makes this require SSE4 instead of SSE2
    x = _mm_mullo_epi32(x, v0101);
    u = _mm_mullo_epi32(u, v1010);
    v = _mm_mullo_epi32(v, v1010);
    w = _mm_and_si128(w, v000F000F);
    x = _mm_and_si128(x, v000F000F);
    u = _mm_and_si128(u, v0F000F00);
    v = _mm_and_si128(v, v0F000F00);
    u = _mm_add_epi32(u, w);
    v = _mm_add_epi32(v, x);
    u = _mm_mullo_epi32(u, v0011);
    v = _mm_mullo_epi32(v, v0011);

    // write output
    _mm_store_si128((__m128i*)(out     ), u);
    _mm_store_si128((__m128i*)(out   16), v);
    in   = 16;
    out  = 32;
  } while (in != past);
}

int main() {
  unsigned char *const indat   = new unsigned char[DATA_SIZE_IN ],
                *const outdat0 = new unsigned char[DATA_SIZE_OUT],
                *const outdat1 = new unsigned char[DATA_SIZE_OUT],
                *      curout  = outdat0,
                *      lastout = outdat1,
                *      place;
  unsigned             start,
                       stop;

  place = indat   DATA_SIZE_IN - 1;
  do {
    *place = (unsigned char)rand();
  } while (place-- != indat);
  MakeLutLo();
  MakeLutHi();
  MakeLutLarge();

  for (unsigned testcount = 0; testcount < 1000;   testcount) {
    // solution posted by asker
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT;   rerun)
      ExpandOrig(indat, indat   DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandOrig:\t\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);

    // Dmitry's small lookup table solution
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT;   rerun)
      ExpandLookupSmall(indat, indat   DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandSmallLUT:\t\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // Dmitry's small lookup table solution using only one lookup table
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT;   rerun)
      ExpandLookupSmallOneLUT(indat, indat   DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandLookupSmallOneLUT:\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // large lookup table solution
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT;   rerun)
      ExpandLookupLarge(indat, indat   DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandLookupLarge:\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // AShelly's Interleave bits by Binary Magic Numbers solution
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT;   rerun)
      ExpandAShelly(indat, indat   DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandAShelly:\t\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // AShelly's Interleave bits by Binary Magic Numbers solution optimizing out an addition
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT;   rerun)
      ExpandAShellyMulOp(indat, indat   DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandAShellyMulOp:\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // my SSE4 solution
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT;   rerun)
      ExpandSSE4(indat, indat   DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandSSE4:\t\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // my SSE4 solution unrolled
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT;   rerun)
      ExpandSSE4Unroll(indat, indat   DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandSSE4Unroll:\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // my SSE2 solution
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT;   rerun)
      ExpandSSE2(indat, indat   DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandSSE2:\t\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // my SSE2 solution unrolled
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT;   rerun)
      ExpandSSE2Unroll(indat, indat   DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandSSE2Unroll:\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // AShelly's Interleave bits by Binary Magic Numbers solution implemented using SSE2
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT;   rerun)
      ExpandAShellySSE4(indat, indat   DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandAShellySSE4:\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;
  }

  delete[] indat;
  delete[] outdat0;
  delete[] outdat1;
  return 0;
}

NOT:
SSE4 bir uygulama burada başlangıçta vardı. Daha fazla platform üzerinde çalışacak çünkü bunu uygulamak için bir yol daha iyi olan SSE2 kullanarak buldum. SSE2 uygulanması da daha hızlı. Yani, çözüm üst kısmında sunulan şimdi SSE2 uygulanması ve SSE4. SSE4 uygulaması hala geçmişi düzenlemek performans testleri veya görülebilir.

Bunu Paylaş: