Monday, November 12, 2018

Q10 : Most frequently used words in a text

Question 10: Write a function that, given a string of text (possibly with punctuation and line-breaks), returns an array of the top-3 most occurring words, in descending order of the number of occurrences.

Assumptions:
A word is a string of letters (A to Z) optionally containing one or more apostrophes (') in ASCII. (No need to handle fancy punctuation.)
Matches should be case-insensitive, and the words in the result should be lowercased.
Ties may be broken arbitrarily.
If a text contains fewer than three unique words, then either the top-2 or top-1 words should be returned, or an empty array if a text contains no words.
Examples:
top_3_words("In a village of La Mancha, the name of which I have no desire to call to
mind, there lived not long since one of those gentlemen that keep a lance
in the lance-rack, an old buckler, a lean hack, and a greyhound for
coursing. An olla of rather more beef than mutton, a salad on most
nights, scraps on Saturdays, lentils on Fridays, and a pigeon or so extra
on Sundays, made away with three-quarters of his income.")

# => ["a", "of", "on"]

top_3_words("e e e e DDD ddd DdD: ddd ddd aa aA Aa, bb cc cC e e e")

# => ["e", "ddd", "aa"]

top_3_words("  //wont won't won't")

# => ["won't", "wont"]

Bonus points (not really, but just for fun):
Avoid creating an array whose memory footprint is roughly as big as the input text.
Avoid sorting the entire array of unique words.

5 comments:

  1. from heapq import nlargest
    import operator
    def top_3_words(text):

    words = {}
    word_list = text.lower().split()
    if len(word_list) > 0:
    for i in word_list:
    if i in words:
    words[i]+=1
    else:
    words[i]=1
    return [i for (i,j) in nlargest(3,words.items(),key=operator.itemgetter(1))]
    else:
    return []

    print(top_3_words(input()))

    ReplyDelete
    Replies
    1. Hi,
      Even though same words occurs again with some special chars at the end, it is accounted as a new word instead of increasing the existing word count.

      eg: ddd, ddd, ddd:, DDD: all are same words, but ddd: is accounted as a new word since it has a ":" at the end.

      So, i believe we have to get rid of all the special chars (except permitted like single quotes) initially and the need to do the counter work.

      Please find my solution in below comment.

      Delete
    2. Yes, You are right. first we need to remove all the special chars. thanks and I ll update my solution

      Delete
  2. import re,operator
    from collections import Counter
    from heapq import nlargest
    def top_3_words(text):
    if len(text)>0:
    words = dict(Counter(re.findall(r'\w+', text.lower())))
    return [i for (i,j) in nlargest(3,words.items(),key=operator.itemgetter(1))]
    else:
    return []
    print(top_3_words(input()))

    ReplyDelete
  3. import re
    from collections import Counter
    def top_3_words(words):
    strip_words = re.sub(r'[^ a-zA-Z0-9\']', '', words.lower()).split()
    counted_words = Counter(strip_words)
    return sorted(counted_words, key=counted_words.__getitem__, reverse=True)[:3]

    ReplyDelete