Big Data Special Topic

Big Data Special Topic

Data Scale Estimation

Average length of search keyword strings: 50 bytes 100 million strings require approximately 5GB of memory space 100 million integers require approximately 400MB 100 million 1-byte data equals 1GB

TopK Problem

Finding the K largest elements among n pieces of data? Time complexity: Nlog(K) Maintain a min-heap of size K. When full, compare with the heap top element. If larger than the heap top element, delete the heap top element and insert the new number into the heap; if smaller than the heap top, do nothing.

Find the top 10 most popular search keywords from 1 billion log files? There may be a lot of duplicates. Use hash tables for counting. If the data volume is too large, to avoid high memory consumption caused by hash table expansion, you can first divide into 10 small files through hash modulo sharding. Each file uses hash tables and heaps to maintain the Top10 in each file, then use a heap again to get the final top 10 from these 100 keywords.

Sorting

  1. There are 100 sorted small files, each 100MB in size. How to merge them into one sorted large file? Maintain a min-heap with 100 elements. Each time read one line from each of the 100 files, then output the smallest one to the large file. Then continue reading from the file that contained the smallest line and put it back into the heap until all file data has been written to the large file.

Real-time Ranking

Use skip lists to find the number of nodes less than a certain value.

We can use an array of size 1,000,000 to represent the relationship between scores and rankings, where ranks represents the ranking corresponding to score s. During initialization, the rank array can be calculated from the user_score table with O(n) complexity. User ranking queries and updates are based on this array. Querying the ranking corresponding to score s directly returns ranks, with complexity O(1); when a user’s score changes from s to s+n, we only need to increment the values of n elements from ranks to ranks+n-1, with complexity O(n).

Real-time TopN

Maintain a min-heap, and use a hash table to maintain user IDs in the heap. If a user’s score changes and they are in the heap, directly modify the data in the heap; if not in the heap, compare whether the new user’s score is greater than the heap top user’s score. If so, pop the heap top and let the new user enter the heap, updating the hash table. Real-time sorting can be handled by the client side.

Big Data Sensitive Word Filtering

AC Automaton + Trie Tree

Big Data Deduplication?

1 billion URLs, each URL size less than 56B, require deduplication, with 4GB memory available? Answer: First perform distributed sorting across multiple machines (at least 14), then merge batches across machines, finally perform deduplication with O(1) space and O(N) time.

For numeric deduplication, bitmaps can be used.

Find Common Lines Between Two Large Data Files?

  1. Bloom Filter Error rate less than 0.01, the number of bits in the bitmap needs to be 13 times the input count n
  2. Hash Sharding Common lines’ hash values will be assigned to the same file pairs, such as the 1st small file of File A and the 1st small file of File B. Then sequentially read A’s small files to build a hash map, then read B’s small files for detection.