挑战：优化函数 MatchCharInString 乐筑天下

pkohut 发表于 2010-2-26 07:48:01

挑战：优化函数 MatchCharInString

修改-添加的源代码项目文件
下面列出的函数MatchCharInString（还附加了测试工具的项目文件）可以工作，但没有我希望的那么有效。如果你除了，你的任务是让它更快。请注意，字符串是UTF-16多字节的。这意味着“for and time循环”可能是错误的优化，因为它需要与不同的Unicode本地文件一起使用。一个明显的优化是不要复制pcszSource和pcszMatchChars中的字符串。这里有一些基本规则，MatchCharsInString参数不能更改，CString和/或AString不是可行的选项。TIA，玩得开心。
包含的项目文件的示例输出。
在这里，第一个测试将字符'B''B''O'与字符串“这是一个测试”进行比较，并且没有找到匹配项，因为没有找到'B'。
第二行测试'B'B''O'反对“Bob说，“Hello World！”并在位置0-'B'，2-'B'，15-'O'找到匹配项。
'B''B''O'的每个字符在字符串中只匹配一次，如果找到，则继续搜索下一个字符串位置的下一个字符。
Testing -- BBO
Does not match: This is a Test
      Matches: Bob said, "Hello World!" 0 2 15
Does not match: Lizzard Lips
Does not match: Zebra tails             2
Does not match: Luke warm milk
Testing -- RZ
Does not match: This is a Test
Does not match: Bob said, "Hello World!" 19
Does not match: Lizzard Lips             5
Does not match: Zebra tails             3
Does not match: Luke warm milk          7
Testing -- ZR
Does not match: This is a Test
Does not match: Bob said, "Hello World!"
      Matches: Lizzard Lips             2 5
      Matches: Zebra tails             0 3
Does not match: Luke warm milk
Testing -- KML
Does not match: This is a Test
Does not match: Bob said, "Hello World!"
Does not match: Lizzard Lips
Does not match: Zebra tails
      Matches: Luke warm milk          2 8 12

MatchCharInString
// Does a lowercase match each character in sMatchChars to a character in the string sSource
// (also lower case).
// Each match is done in sequence. When the first character is found from sMatchChars the
// next character from sMatchChars is searched for the current position of sSource to
// the end of sSource string.
// The position of each match is stored in the vector index that is passed by reference.
// Returns true if all sMatchChars are found before reaching the end of sSource, false otherwise.
bool MatchCharsInString(const wchar_t * pcszSource, const wchar_t * pcszMatchChars, std::vector & index)
{
std::wstring sSource(pcszSource);
std::wstring sMatchChars(pcszMatchChars);
// change input strings to lower case
std::transform(sSource.begin(), sSource.end(), sSource.begin(), std::tolower);
std::transform(sMatchChars.begin(), sMatchChars.end(), sMatchChars.begin(), std::tolower);
bool bFound = true;
size_t pos = 0;
std::wstring::iterator itChar = sMatchChars.begin();
for(itChar; itChar != sMatchChars.end(); itChar++) {
   pos = sSource.find_first_of(*itChar, pos);
   if(pos == std::wstring::npos) {
         // if the position of sSource is the end of the string then
         // there was not a complete match of each character in sMatchChars
         bFound = false;
         break;
   }
   index.push_back(pos);//push valid position to index
   pos++;
}
return bFound;
}

**** Hidden Message *****

pkohut 发表于 2010-2-26 07:53:53

小心。
这个版本非常快和错误。为什么？因为它不考虑本地 Unicode 设置，并且会轰炸任何多字节 Unicode 值。
////////////////////////////////////////////////////////////////////////
// Will not work for some unicode locales.
////////////////////////////////////////////////////////////////////////
bool MatchString(const wchar_t * pcszSource, const wchar_t * pcszMatchChars)
{
while(*pcszMatchChars) {
wchar_t c = std::tolower(*pcszMatchChars);
while(*pcszSource) {
if(c == std::tolower(*pcszSource)) {
break;
}
pcszSource++;
}
if(!*pcszSource)
break;
pcszMatchChars++;
}
return (*pcszSource && !*pcszMatchChars);
}

pkohut 发表于 2010-2-26 22:13:39

我不知道
#include "stdafx.h"
#include
#include
#include
void MatchString(const wchar_t *pcszSource, const wchar_t *pcszMatchChars,
            std::vector & index)
{
std::locale loc;
bool bMatchIsHigh;
bool bSrcIsHigh;
for(size_t idx = 0; *pcszSource != '\0'; idx++)
{
if(*pcszMatchChars != '\0')
{
   bMatchIsHigh = IS_HIGH_SURROGATE(*pcszMatchChars);
   bSrcIsHigh = IS_HIGH_SURROGATE(*pcszSource);
   if(!bMatchIsHigh && !bSrcIsHigh)
   {
   if(std::tolower(*pcszSource,loc) == std::tolower(*pcszMatchChars,loc))
   {
      index.push_back(idx);
      pcszMatchChars++;
   }
   pcszSource++;
   }
   else if(bMatchIsHigh && bSrcIsHigh)
   {
   if( *pcszSource+1 != '\0' && *pcszMatchChars+1 != '\0')
   {
      UINT32 a = ((*pcszSource - 0xD800) * 0x400) +
         (*pcszSource+1 - 0xDC00) + 0x10000;
      UINT32 b = ((*pcszMatchChars - 0xD800) * 0x400) +
         (*pcszMatchChars+1 - 0xDC00) + 0x10000;
      if(std::tolower(a,loc) == std::tolower(b,loc))
      {
         index.push_back(idx);
         pcszMatchChars++;
         pcszMatchChars++;
      }
      pcszSource++;
      pcszSource++;
   }
   else
   {
      return;// bad format
   }
   }
   else if(bMatchIsHigh&& !bSrcIsHigh)
   {
   pcszSource++;
   }
   else if(!bMatchIsHigh && bSrcIsHigh)
   {
   pcszSource++;
   pcszSource++;
   }
}
else
{
   break;
}
}
}
int _tmain(int argc, _TCHAR* argv[])
{
std::vector index;
wchar_t a[] = {'d', 'a', 0xD834, 0xDD1E,'\0'};
wchar_t b[] = {'a', 0xD834, 0xDD1E,'\0'};
//wchar_t a[] = {'L','u','k','e',' ',0xD834,0xDD1E,'w','a','r','m',' ','m','i','l','k','\0'};
//wchar_t b[] = {'K',0xD834,0xDD1E,'L','\0'}; // 2 5 13
//wchar_t a[] = {'L','u','k','e',' ','w','a','r','m',' ','m','i','l','k','\0'};
//wchar_t b[] = {'K','M','L','\0'};
MatchString(a,b,index);
for(size_t i = 0; i < index.size(); i++)
{
wprintf(_T("%ld "),index);
}
system("pause");
return 0;
}

pkohut 发表于 2010-2-27 05:47:53

这是一个很好的解决方案，丹。我没有看到代理宏，它们可能会工作。
http://msdn.microsoft.com/en-us/library/dd374069（第85节）。aspx
我认为
      UINT32 a = ((*pcszSource - 0xD800) * 0x400) +
         (*pcszSource+1 - 0xDC00) + 0x10000;
      UINT32 b = ((*pcszMatchChars - 0xD800) * 0x400) +
         (*pcszMatchChars+1 - 0xDC00) + 0x10000;
      if(std::tolower(a,loc) == std::tolower(b,loc))
如果（std：：tolower（pcszSource，loc）=std：：ToLowers（pcsz Mathchars，loc））
可以简化为
，因为它是区域设置感知的（未测试）。如果是这种情况，那么最初的问题就变成了在将指针发送到tolower之前对指针进行简单的记录
唯一要做的另一件事是在应用程序的早期初始化std：：locale，并有一个返回其引用的函数，然后这件事就会很快完成
在我的原始版本中，我使用了＜cctype＞中的std：：tolower，因为我没有理解“独立于语言环境”的含义，这是一个很好的理解<再次感谢。

pkohut 发表于 2010-2-27 06:43:56

BTW，能不能澄清一下
uint 32 a =((* pcsz source-0xd 800)* 0x 400)+
(* pcsz source+1-0x DC 00)+0x 10000的意图；

pkohut 发表于 2010-2-27 06:49:03

我认为它是UTF-32的转换，虽然我可能用错了。

pkohut 发表于 2010-2-27 06:50:09

我不知道0xd800以上的字符是否有小写字母，最初我只知道它像＜br＞或＜pre＞一样if((*pcszSource + *pcszSource+1) ==
(*pcszMatchChars + *pcszMatchChars+1))

pkohut 发表于 2010-2-27 06:58:18

哦，我明白了
我认为，使用代理宏，您只需将指针移动到正确的位置，然后让std：：tolower（ptr，loc）处理其余部分。我不确定，因为我的脑袋埋在STL和Boost库中，试图找出它，但所有泛型都妨碍了理解。

pkohut 发表于 2010-2-27 07:13:58

你可能是对的

pkohut 发表于 2010-2-27 21:42:53

不同字符串的内存位置在OS X中是这样的。注意szStr和sStr是用UTF-8编码处理的，而wchar_t和wstring是UTF-32（在Windows上它们是UTF-16）。

页: [1] 2

乐筑天下's Archiver

挑战：优化函数 MatchCharInString