map and flatMap are similar, in the sense they take a line from the input RDD and apply a function on it. The way they differ is that the function in map returns only one element, while function in flatMap can return a list of elements (0 or more) as an iterator.
Also, the output of the flatMap is flattened. Although the function in flatMap returns a list of elements, the flatMap returns an RDD which has all the elements from the list in a flat way (not a list).
Sounds a bit confusing. In the below code snippet, on the input lines both map and flatMap are applied and output dumped in HDFS to wordsWithMap and wordsWithFlatMap folder.
from pyspark import SparkContext
sc = SparkContext("spark://bigdata-vm:7077", "Map")
lines = sc.parallelize(["hello world", "hi"])
wordsWithMap = lines.map(lambda line: line.split(" ")).coalesce(1)
wordsWithFlatMap = lines.flatMap(lambda line: line.split(" ")).coalesce(1)
The input function to map returns a single element, while the flatMap returns a list of elements (0 or more). And also, the output of the flatMap is flattened.
In the case of word count, where the input line is split into multiple words, flatMap can be used. Also, in the case of weather data set, the extractData nethod will validate the record and might or might not return a value. In this case also, flatMap can be used.