parsing people and groups#
Licence CC BY-NC-ND, Thierry Parmentelat
grab the zip for starters
This activity is about parsing text files, and building structures using builtin types.
read a file#
works on:
listfiletuplethe input: file contains lines like
first_name last_name email phone
fields are separated by any number (but at least one) of spaces/tabs, like e.g. in file
people-simple-03Mathilde Martin mathilde.martin@example.com 0987654321 Jean Dupont jean.dupont@example.com 0123456789 Alice de-la-Borderie alice.de-la-borderie@mystartup.io 0478362410
todo: write a function for parsing this format; it should return a list of 4-tuples
def parse(filename) -> list[tuple[str, str, str, str]]: ...
discussion
in the whole TP we will model a person as a 4-tuple
however we could just as well have decided to use instead a dictionary with 4 keys
discuss the pros and cons of each approach
indexing#
works on: hash-based types, comprehensions
what we need: a fast way to
check whether an email is in the file
quickly retrieve the details that go with a given email
question: what is the right data structure to implement that ?
todo:
write a function
def index(list_of_tuples):
that builds and returns that data structure
write a function
def initial(list_of_tuples):
that indexes the data on the initial of the first name (what changes do we need to do on the resulting data structure ?)
dataframe (optional)#
works on: dataframes
todo: build a pandas dataframe to hold all the data
tip: see the documentation of
pd.DataFrame()and observe that there are multiple interfaces to build a dataframe
groups#
works on:
seta more elaborate input
the file now contains optional fieldsfirst_name last_name email phone [group1 .. groupn]
where the part between
[]is optional, i.e there can be 0 or more groupnames mentioned on each student line; like e.g. frompeople-groups-10:Laurine Bodin Laurine.Bodin@green.org 0400419660 maths french Tom Côté Tom.Côté@green.org 0373230810 french Pauline Adrien Pauline.Adrien@traditional.com 0488588126 maths Franck Dupont Franck.Dupont@green.org 0393821035 french Julia Castillon Julia.Castillon@mystartup.io 0627047562 maths french Rémi Archambeau Rémi.Archambeau@mystartup.io 0223515785 maths french Adam Beaufort Adam.Beaufort@mystartup.io 0196433784 Agathe Alarie Agathe.Alarie@mystartup.io 0632393074 Matthieu Bois Matthieu.Bois@green.org 0675802411 maths french Emma Lavigne Emma.Lavigne@mystartup.io 0349239656 maths
todo duplicate and tweak the
parsefunction, so as to writedef group_parse(filename):
so it now returns a 2-tuple with
the list of tuples as before
a dictionary of sets
the keys here will be the group names,
and the corresponding value is a set of tuples corresponding to the students in that group
regexps (optional)#
works on: regexps
what we need: be able to check the format for the input file:
first_name and last_name may contain letters and
-and_email may contain letters, numbers, dots (.), hyphens (-) and must contain exactly one
@phone numbers may contain 10 digits, or
+33followed by 9 digits
todo: write a function
def check_values(L: list[tuple]) -> None:
that expects as an input the output of
parse, and that outlines ill-formed inputnote on ASCII vs Unicode input:
in a first approximation, use patterns like
a-zto check for letters;how does this behave with respect to names with accents and cedillas
then play with
\wto see if you can overcome this problem