NLP (nature language processing) project ,分类垃圾邮件¶

In [1]:
import findspark
findspark.init()
findspark.find()
Out[1]:
'/Users/heyunan/opt/anaconda3/lib/python3.9/site-packages/pyspark'

Instantiate a spark session¶

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SpamClassifier').getOrCreate()
24/12/11 15:58:48 WARN Utils: Your hostname, s-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 10.201.147.28 instead (on interface en0)
24/12/11 15:58:48 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/12/11 15:58:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Load text data¶

In [3]:
# Load data and rename column
df = spark.read.option("header", "false") \
    .option("delimiter", "\t") \
    .option("inferSchema", "true") \
    .csv("SMSSpamCollection.txt") \
    .withColumnRenamed("_c0", "label_string") \
    .withColumnRenamed("_c1", "sms")

df.limit(10).show()
+------------+--------------------+
|label_string|                 sms|
+------------+--------------------+
|         ham|Go until jurong p...|
|         ham|Ok lar... Joking ...|
|        spam|Free entry in 2 a...|
|         ham|U dun say so earl...|
|         ham|Nah I don't think...|
|        spam|FreeMsg Hey there...|
|         ham|Even my brother i...|
|         ham|As per your reque...|
|        spam|WINNER!! As a val...|
|        spam|Had your mobile 1...|
+------------+--------------------+

24/12/11 15:59:05 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors

任务:请你用这个数据集训练一个朴素贝叶斯分类器用于判断(分类)一个邮件是垃圾邮件(spam)还是正常邮件(ham)?¶

In [ ]: