classe WordCounts

classe `WordCounts`#

on veut calculer la fréquence d’apparition des mots dans un texte
pour cela on vous demande d’écrire une classe qui s’utilise comme ceci

données

le fichier texte contient le premier chapitre du Hitch-Hiker’s Guide to the Galaxy - alias hhgg
vous pouvez le télécharger ici
vous pouvez aussi utiliser n’importe quel autre document au format text brut

from wordcounts import WordCounts

wc = WordCounts("wordcounts-data.txt")

# on choisit arbitrairement d'afficher les 5 mots les + fréquents
print(wc)

wordcounts-data.txt: 1580 total words570 different words    the : 65
     he : 56
      a : 52
     to : 52
     it : 40

# ensuite on peut chercher le nombre d'occurences comme ceci

for word in ['arthur', 'people']:
    print(f"word {word} was found {wc.counter[word]} times")

word arthur was found 16 times
word people was found 9 times

# et voir si un mot apparait ou pas

for word in ['arthur', 'armageddon']:
    present = word in wc.vocabulary()
    print(f"is word '{word}' present ? : {present} ")

is word 'arthur' present ? : True 
is word 'armageddon' present ? : False 

Indices#

il est raisonnable de tout mettre en minuscule une bonne fois au tout début du traitement
voyez éventuellement le module standard string, et string.punctuation
sachez aussi que le texte en question contient des apostrophes non-ASCII “”
voyez aussi la classe collections.Counter, qui va vous rendre la vie bien plus facile

variantes#

comment trouveriez-vous tous les mots qui apparaissent entre 30 et 40 fois dans le texte ?
si vous vous sentez confortable (il faut faire de la surcharge d’opérateur), faites en sorte qu’on puisse aussi écrire:

for word in ['arthur', 'people']:
    # here we can index the WordCount instance directly
    print(f"word {word} was found {wc[word]} times")

word arthur was found 16 times
word people was found 9 times

solution#

la classe

"""
playing with word frequencies
"""

from collections import Counter
from string import punctuation

# the text has unicode quotes in it
punctuation +=  "“”"

class WordCounts:

    def __init__(self, filename) -> None:
        # just in case, keep for future reference
        self.filename = filename

        # the words as they appear in the text, all lowercase and with punctuation removed
        words = []

        # read and erase punctuation
        with open(self.filename) as feed:
            for line in feed:
                line = line.strip().lower()
                for char in punctuation:
                    line = line.replace(char, " ")
                # add the words in the list
                words.extend(line.split())

        # using Counter makes it easier
        self.counter = Counter(words)

    def __repr__(self) -> str:
        result = ""
        result += f"{self.filename}:"
        result += f" {self.size()} total words"
        result += f"{len(self.vocabulary())} different words"
        result += "\n".join(f"  {w:>5} : {c}" for w, c in self.counter.most_common(5))
        return result

    def size(self) -> int:
        """
        number of words in the original text
        """
        # only in 3.10
        #return self.counter.total()
        # a Counter is a dict
        return sum(value for value in self.counter.values())

    def vocabulary(self) -> set[str]:
        """
        return the set of words used in the text
        """
        return set(self.counter.elements())

    # pour la variante
    def __getitem__(self, word: str) -> int:
        return self.counter[word]

les recherches

# les mots apparassant entre 30 et 40 fois

{word for word, count in wc.counter.items() if 30 <= count <= 40}

-> {'and', 'it', 'was'}