Metnin bir terabayt ayrıştırma ve verimli bir şekilde her sözcüğün geçtiği saymayı

SORU

30 AĞUSTOS 2012, PERŞEMBE

Metnin bir terabayt ayrıştırma ve verimli bir şekilde her sözcüğün geçtiği saymayı

Geçenlerde bir röportajda bir soru karşısında şunları yapmalıdır olan herhangi bir dilde bir algoritma oluşturmak için geldim

İçerik 1 terabayt oku
İçerik her reoccuring sözcük sayımı yapmak
En sık görülen ilk 10 kelime listesi

Bana bunun için bir algoritma oluşturmak için mümkün olan en iyi şekilde haber verir misiniz?

Düzenleme:

TAMAM, hadi içeriğin İngilizce olduğunu söylüyorlar. En sık içerik gerçekleşen ilk 10 kelimeleri nasıl bulabiliriz? Benim diğer şüphe yok ki, eğer bilerek sonra benzersiz verileri veriyoruz eğer bizim yığın taşması arabellek boyutu ile sona erecek. Bunu da halletmemiz gerekiyor.

CEVAP

12 EYLÜL 2012, ÇARŞAMBA

Cevap Röportaj

Bu görev çok karmaşık olmadan ilginç, iyi bir teknik tartışma başlatmak için harika bir yoldur. Bu görevin üstesinden gelmek için planım:

Bir deyişle veri girişi bölünmüş, beyaz boşluk kullanımı ve sınırlayıcı olarak noktalama
Her kelime düğüm bir kelime son harfi temsil eden sayaç güncellendi Trie bir yapı içine yem
Traverse tamamen dolu düğümleri bulmak için ağacın en yüksek sayar

Röportaj bağlamında ... tahta ya da kağıt üzerinde ağaç çizimi Trie fikri göstermek istiyorum. Baştan boş, o ağacın tek bir cümle en az bir yinelenen sözcük içeren dayanarak inşa etmek. Demek< . em ^"kedi fareyi yakala". Ağacın en yüksek sayıları bulmak için geçiş nasıl sonunda göster. Ben o zaman nasıl haklı bu ağacın sağlar iyi bellek kullanımı, iyi kelime arama hızı (özellikle davanın doğal dil için pek çok kelime türetmek birbirimizi), ve için uygundur paralel işleme.

Tahtaya çizmek

Draw the example trie

Demo

C# programı altında 4 çekirdekli bir xeon W3520 75secs metin, 8 iş parçacığı en fazla 2 GB geçer. Performans civarındaSaniyede 4.3 milyon kelimedaha az, en uygun giriş kodu ayrıştırma. Mağaza kelimeler Trie structure ile bellek doğal dil giriş işlerken bir sorun değildir.

Notlar:

test metin Gutenberg project elde
giriş ayrıştırma kod Satır sonları varsayar ve sub-optimal olması güzel
ve sözcük olmayan diğer noktalama kaldırılması çok iyi yapılır
işlemebüyük bir dosyabirkaç küçük bir yerine kod küçük bir miktar okumaya başlamak için gerektirecektir belirtilen konuları arasında dosya içinde mahsup.

using System;
using System.Collections.Generic;
using System.Collections.Concurrent;
using System.IO;
using System.Threading;

namespace WordCount
{
    class MainClass
    {
        public static void Main(string[] args)
        {
            Console.WriteLine("Counting words...");
            DateTime start_at = DateTime.Now;
            TrieNode root = new TrieNode(null, '?');
            Dictionary<DataReader, Thread> readers = new Dictionary<DataReader, Thread>();

            if (args.Length == 0)
            {
                args = new string[] { "war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt",
                                      "war-and-peace.txt", "ulysees.txt", "les-miserables.txt", "the-republic.txt" };
            }

            if (args.Length > 0)
            {
                foreach (string path in args)
                {
                    DataReader new_reader = new DataReader(path, ref root);
                    Thread new_thread = new Thread(new_reader.ThreadRun);
                    readers.Add(new_reader, new_thread);
                    new_thread.Start();
                }
            }

            foreach (Thread t in readers.Values) t.Join();

            DateTime stop_at = DateTime.Now;
            Console.WriteLine("Input data processed in {0} secs", new TimeSpan(stop_at.Ticks - start_at.Ticks).TotalSeconds);
            Console.WriteLine();
            Console.WriteLine("Most commonly found words:");

            List<TrieNode> top10_nodes = new List<TrieNode> { root, root, root, root, root, root, root, root, root, root };
            int distinct_word_count = 0;
            int total_word_count = 0;
            root.GetTopCounts(ref top10_nodes, ref distinct_word_count, ref total_word_count);
            top10_nodes.Reverse();
            foreach (TrieNode node in top10_nodes)
            {
                Console.WriteLine("{0} - {1} times", node.ToString(), node.m_word_count);
            }

            Console.WriteLine();
            Console.WriteLine("{0} words counted", total_word_count);
            Console.WriteLine("{0} distinct words found", distinct_word_count);
            Console.WriteLine();
            Console.WriteLine("done.");
        }
    }

    #region Input data reader

    public class DataReader
    {
        static int LOOP_COUNT = 1;
        private TrieNode m_root;
        private string m_path;        

        public DataReader(string path, ref TrieNode root)
        {
            m_root = root;
            m_path = path;
        }

        public void ThreadRun()
        {
            for (int i = 0; i < LOOP_COUNT; i  ) // fake large data set buy parsing smaller file multiple times
            {
                using (FileStream fstream = new FileStream(m_path, FileMode.Open, FileAccess.Read))
                {
                    using (StreamReader sreader = new StreamReader(fstream))
                    {
                        string line;
                        while ((line = sreader.ReadLine()) != null)
                        {
                            string[] chunks = line.Split(null);
                            foreach (string chunk in chunks)
                            {
                                m_root.AddWord(chunk.Trim());
                            }
                        }
                    }
                }
            }
        }
    }

    #endregion

    #region TRIE implementation

    public class TrieNode : IComparable<TrieNode>
    {
        private char m_char;
        public int m_word_count;
        private TrieNode m_parent = null;
        private ConcurrentDictionary<char, TrieNode> m_children = null;

        public TrieNode(TrieNode parent, char c)
        {
            m_char = c;
            m_word_count = 0;
            m_parent = parent;
            m_children = new ConcurrentDictionary<char, TrieNode>();            
        }

        public void AddWord(string word, int index = 0)
        {
            if (index < word.Length)
            {
                char key = word[index];
                if (char.IsLetter(key)) // should do that during parsing but we're just playing here! right?
                {
                    if (!m_children.ContainsKey(key))
                    {
                        m_children.TryAdd(key, new TrieNode(this, key));
                    }
                    m_children[key].AddWord(word, index   1);
                }
                else
                {
                    // not a letter! retry with next char
                    AddWord(word, index   1);
                }
            }
            else
            {
                if (m_parent != null) // empty words should never be counted
                {
                    lock (this)
                    {
                        m_word_count  ;                        
                    }
                }
            }
        }

        public int GetCount(string word, int index = 0)
        {
            if (index < word.Length)
            {
                char key = word[index];
                if (!m_children.ContainsKey(key))
                {
                    return -1;
                }
                return m_children[key].GetCount(word, index   1);
            }
            else
            {
                return m_word_count;
            }
        }

        public void GetTopCounts(ref List<TrieNode> most_counted, ref int distinct_word_count, ref int total_word_count)
        {
            if (m_word_count > 0)
            {
                distinct_word_count  ;
                total_word_count  = m_word_count;
            }
            if (m_word_count > most_counted[0].m_word_count)
            {
                most_counted[0] = this;
                most_counted.Sort();
            }
            foreach (char key in m_children.Keys)
            {
                m_children[key].GetTopCounts(ref most_counted, ref distinct_word_count, ref total_word_count);
            }
        }

        public override string ToString()
        {
            if (m_parent == null) return "";
            else return m_parent.ToString()   m_char;
        }

        public int CompareTo(TrieNode other)
        {
            return this.m_word_count.CompareTo(other.m_word_count);
        }
    }

    #endregion
}

Metin aynı 20 MB işleme çıktısı 8 iş parçacığı üzerinde 100 kez burada.

Counting words...
Input data processed in 75.2879952 secs

Most commonly found words:
the - 19364400 times
of - 10629600 times
and - 10057400 times
to - 8121200 times
a - 6673600 times
in - 5539000 times
he - 4113600 times
that - 3998000 times
was - 3715400 times
his - 3623200 times

323618000 words counted
60896 distinct words found

Bunu Paylaş: