NLP (nature language processing) project ,分类垃圾邮件¶
In [1]:
import findspark
findspark.init()
findspark.find()
Out[1]:
'/Users/heyunan/opt/anaconda3/lib/python3.9/site-packages/pyspark'
Instantiate a spark session¶
In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SpamClassifier').getOrCreate()
24/12/11 15:58:48 WARN Utils: Your hostname, s-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 10.201.147.28 instead (on interface en0) 24/12/11 15:58:48 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 24/12/11 15:58:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Load text data¶
In [3]:
# Load data and rename column
df = spark.read.option("header", "false") \
.option("delimiter", "\t") \
.option("inferSchema", "true") \
.csv("SMSSpamCollection.txt") \
.withColumnRenamed("_c0", "label_string") \
.withColumnRenamed("_c1", "sms")
df.limit(10).show()
+------------+--------------------+ |label_string| sms| +------------+--------------------+ | ham|Go until jurong p...| | ham|Ok lar... Joking ...| | spam|Free entry in 2 a...| | ham|U dun say so earl...| | ham|Nah I don't think...| | spam|FreeMsg Hey there...| | ham|Even my brother i...| | ham|As per your reque...| | spam|WINNER!! As a val...| | spam|Had your mobile 1...| +------------+--------------------+
24/12/11 15:59:05 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
任务:请你用这个数据集训练一个朴素贝叶斯分类器用于判断(分类)一个邮件是垃圾邮件(spam)还是正常邮件(ham)?¶
In [ ]: