一个完整的大作业——新闻
1.选一个自己感兴趣的主题。
2.网络上爬取相关的数据。
3.进行文本分析,生成词云。
4.对文本分析结果解释说明。
5.写一篇完整的博客,附上源代码、数据爬取及分析结果,形成一个可展示的成果。
本次大作业选择的主题是新闻,其链接是http://news.sina.com.cn/world/:
网络上爬取相关的数据:
import requestsfrom bs4 import BeautifulSoupurl = 'http://news.sina.com.cn/world/'res = requests.get(url)res.encoding = 'UTF-8'soup = BeautifulSoup(res.text, 'html.parser')for news in soup.select('.news-item'): h2 = news.select('h2') if len(h2) > 0: time = news.select('.time')[0].text title = h2[0].text href = h2[0].select('a')[0]['href'] print(title,time,href)
进行文本分析,生成词云:
import requestsfrom bs4 import BeautifulSoupfrom os import path from scipy.misc import imread import jieba import sys import matplotlib.pyplot as plt from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator text = open('D:\\world.txt').read() wordlist = jieba.cut(text) #cut_all = True wl_space_split = " ".join(wordlist) #print wl_space_split d = path.dirname(__file__) nana_coloring = imread(path.join(d, "D:\\1.jpg")) my_wordcloud = WordCloud( background_color = 'white', mask = nana_coloring, max_words = 5000, stopwords = STOPWORDS, max_font_size = 80, random_state = 20, ) # generate word cloud text_dict = { 'you': 2993, 'and': 6625, 'in': 2767, 'was': 2525, 'the': 7845,}my_wordcloud = WordCloud().generate_from_frequencies(text_dict)#my_wordcloud.generate(text_dict) image_colors = ImageColorGenerator(nana_coloring) my_wordcloud.recolor(color_func=image_colors) plt.imshow(my_wordcloud) plt.axis("off") plt.show() my_wordcloud.to_file(path.join(d, "cloudimg.png"))
生成云图:
由词云可以看出,在国际里面,美国特朗普是最大关注点。