By Chris Musselle

克里斯·穆瑟尔(Chris Musselle)

This is the third post in a three part series where I have explored the options available for including both R and Python in a data analysis pipeline. See for some reasons on why you may wish to do this, and details of a general strategy involving flat files. expands on this by showing how R or Python processes can call each other and parse arguments between them.

这是三部分系列文章中的第三篇,我在其中探讨了可以在数据分析管道中同时包含R和Python的选项。 出于某些原因,请参阅 ,您可能希望这样做,以及涉及平面文件的一般策略的详细信息。 通过展示R或Python进程如何相互调用以及解析它们之间的参数来对此进行扩展。

In this post I will be sharing a longer example using these approaches in analysis we carried out at Mango as a proof of concept to cluster news articles. The pipeline involved the use of both R and Python at different stages, with a Python script being called from R to fetch the data, and the exploratory analysis piece being conducted in R.

在这篇文章中,我将分享一个更长的例子,这些例子是我们在Mango进行的分析中使用这些方法进行的,作为将新闻文章聚类的概念证明。 管道涉及在不同阶段使用R和Python,从R调用Python脚本以获取数据,并在R中进行探索性分析。

Full implementation details can be found in the repository , though for brevity this article will focus on the core concepts with the most relevant parts to R and Python integration discussed below.

完整的实现细节可的存储库 ,尽管为简洁起见,本文将重点讨论核心概念,这些核心概念与下面讨论的R和Python集成最相关的部分。

文件丛集 (Document Clustering)

We were interested in the problem of document clustering of live published news articles, and specifically, wished to investigate times when multiple news websites were talking about the same content. As a first step towards this, we looked at sourcing live articles via RSS feeds, and used text mining methods to preprocess and cluster the articles based on their content.

我们对实时发布的新闻文章的文档聚类问题感兴趣,特别是希望调查多个新闻网站谈论同一内容的时间。 作为实现此目标的第一步,我们研究了如何通过RSS feed获取实时文章,并使用文本挖掘方法根据文章的内容对文章进行预处理和聚类。

从RSS源获取新闻文章 (Sourcing News Articles From RSS Feeds)

There are some great Python tools out there for scraping and sourcing web data, and so for this task we used a combination of feedparser, requests, and BeautifulSoup to process the RSS feeds, fetch web content, and extract the parts we were interested in. though the general code structure was as follows:

这里有一些很棒的Python工具可用于抓取和获取Web数据,因此对于此任务,我们结合使用feedparserrequestBeautifulSoup来处理RSS feed,获取Web内容并提取我们感兴趣的部分。尽管一般的代码结构如下:

# fetch_RSS_feed.py# fetch_RSS_feed.pydef def get_articlesget_articles (( feed_urlfeed_url , , json_filenamejson_filename == 'articles.json''articles.json' ):):    """Update JSON file with articles from RSS feed"""    """Update JSON file with articles from RSS feed"""    #    #    # See github link for full function script    # See github link for full function script    #    #if if __name__ __name__ == == '__main__''__main__' ::    # Pass Arguments    # Pass Arguments    args     args = = syssys .. argvargv [[ 11 :]:]    feed_url     feed_url = = argsargs [[ 00 ]]    filepath     filepath = = argsargs [[ 11 ]]    # Get the latest articles and append to the JSON file given    # Get the latest articles and append to the JSON file given    get_articles    get_articles (( feed_urlfeed_url , , filepathfilepath ))

Here we can see that the get_articles function is defined to perform the bulk of the data sourcing tasks, and that the parameters passed to it are the positional arguments from the command line. Within get_articles, the url link, publication date, title and text contents, were then extracted for each article in the RSS feed and stored in a JSON file. For each article, the text content was made up of all HTML paragraph tags within the news article.

在这里,我们可以看到get_articles函数已定义为执行大部分数据源任务,并且传递给它的参数是命令行中的位置参数。 然后 ,在get_articles中 ,为RSS feed中的每篇文章提取URL链接,发布日期,标题和文本内容,并将其存储在JSON文件中。 对于每篇文章,文本内容均由新闻文章中的所有HTML段落标签组成。

Sidenote: The if __name__ == "__main__": line may look strange to non-Python programmers, but this is a common way in Python scripts to control the sections of the code that are run when the whole script is executed, vs when the script is imported by another Python script. If the script is executed directly (as is the case when it is called from R later), the if statement evaluates to true and all code is run. If however, at some point in the future I wanted to reuse get_articles in another Python script, I could now import that function from this script without triggering the code within the if statement.

旁注if __name__ ==“ __main__”:对于非Python程序员来说,这行可能看起来很奇怪,但这是Python脚本中控制执行整个脚本(而不是执行脚本)时所运行的代码段的一种常见方式。脚本由另一个Python脚本导入。 如果直接执行脚本(如以后从R调用脚本的情况),则if语句求值为true并运行所有代码。 但是,如果将来在某个时候我想在另一个Python脚本中重用get_articles ,则现在可以从该脚本导入该函数,而无需触发if语句中的代码。

The above Python script was then executed from within R by defining the utility function shown below. Note that by using stdout=TRUE, any messages printed to stdout with print() in the Python code, can be captured and displaced in the R console.

然后,通过定义以下所示的实用函数,从R内部执行上述Python脚本。 请注意,通过使用stdout = TRUE ,可以在R控制台中捕获并替换在Python代码中使用print()打印到stdout的任何消息。

将数据加载到R中 (Loading Data into R)

Once the data had been written to a JSON file, the next job was to get it into R to be used with the tm package for text mining. This proved a little trickier than first expected however, as the tm package is mainly geared around reading in documents from raw text files, or directories containing multiple text files. To convert the JSON file into the expected VCorpus object for tm I used the following:

将数据写入JSON文件后,下一个工作就是将其放入R,与tm包一起用于文本挖掘。 然而,这证明比最初预期的要难一些,因为tm包主要用于从原始文本文件或包含多个文本文件的目录中读取文档。 要将JSON文件转换为tm的预期VCorpus对象,我使用了以下命令:

<- <- functionfunction (filepath( filepath ) ) {
# Load data from JSONjson_file # Load data from JSONjson_file <- <- filefile (filepath( filepath , , "rb""rb" , encoding , encoding = = "UTF-8""UTF-8" )json_obj )json_obj <- fromJSON<- fromJSON (json_file( json_file ))closeclose (json_file( json_file ))# Convert to VCorpusbbc_texts # Convert to VCorpusbbc_texts <- <- lapplylapply (json_obj( json_obj , FUN , FUN = = functionfunction (x( x ) x) x $text $ text )df )df = = as.data.frameas.data.frame (bbc_texts( bbc_texts )df )df = = tt (df( df )articles )articles = VCorpus= VCorpus (DataframeSource( DataframeSource (df( df ))articles))articles}}

Unicode困境 (Unicode Woes)

One potential problem when manipulating text data from a variety of sources and passing it between languages, is that you can easily get tripped up by character encoding errors on route. We found that by default Python was able to read in, process and write out the article content from the HTML sources, but R was struggling to decode certain characters that were written out to the resulting JSON file. This is due to the languages using or expecting a different character encoding by default.

处理来自各种来源的文本数据并在语言之间传递时,潜在的问题之一是您很容易被路径上的字符编码错误绊倒。 我们发现,默认情况下,Python能够从HTML来源读取,处理和写出文章内容,但是R努力解码某些字符,这些字符被写到结果JSON文件中。 这是由于默认情况下使用或期望使用不同字符编码的语言。

To remedy this, you should be explicit in the encoding you are using when writing and reading a file, by specifying it when opening a file connection. This meant using the following in Python when writing out to a JSON file,

为了解决这个问题,在打开和关闭文件连接时,通过指定文件的读写方式,应该在使用的编码中将其明确。 这意味着在写出JSON文件时在Python中使用以下代码,

and on the R side opening the file connection was as follows:


# Load data from JSONjson_file # Load data from JSONjson_file <- <- filefile (filepath( filepath , , "rb""rb" , encoding , encoding = = "UTF-8""UTF-8" )json_obj )json_obj <- fromJSON<- fromJSON (json_file( json_file ))closeclose (json_file( json_file ))

Here “UTF-8″ Unicode is chosen as it is a good default encoding to use, and is the most popular one used in .

在这里选择“ UTF-8” Unicode是因为它是一种很好的默认编码,并且是使用最广泛的一种。

For more details on Unicode and ways of handling it in Python 2 and 3 .

有关Unicode以及在Python 2和3中处理Unicode的更多详细信息, 。

文本预处理和分析摘要 (Summary of Text Preprocessing and Analysis)

The text preprocessing part of the analysis consisted of the following steps, which were all carried out using the tm package in R:


  • Tokenisation – Splitting text into words.
  • Punctuation and whitespace removal.
  • Conversion to lowercase.
  • Stemming – to consolidate different word endings.
  • Stopword removal – to ignore the most common and therefore least informative words.
  • 标记化–将文本拆分为单词。
  • 标点和空格删除。
  • 转换为小写。
  • 词干–合并不同的词尾。
  • 停用词删除–忽略最常见,因此信息最少的词。

Once cleaned and processed, the statistic was calculated for the collection of articles. This statistic aims to provide a measure of how important each word is for a particular document, across a collection of documents. It is more sophisticated that just using the word frequencies themselves, as it takes into account that some words may naturally occur more frequently than others across all documents.

清洗和处理后, 统计信息以收集物品。 此统计信息旨在提供一个度量,以衡量整个文档集合中每个单词对于特定文档的重要性。 更复杂的是仅使用单词频率本身,因为考虑到在所有文档中某些单词自然会比其他单词更频繁地出现。

Finally a distance matrix was constructed based on the TF-IDF values and hierarchical clustering was performed. The results were then visualised as a dendogram using the dendextend package in R.

最后,基于TF-IDF值构建距离矩阵,并进行层次聚类。 然后使用R中的dendextend程序包将结果可视化为树状图。

An example of the clusters formed from 475 articles published over the last 4 days is shown below where the leaf nodes are coloured according to their source, with blue corresponding to BBC News, green to The Guardian, and indigo to The Independent.


It is interesting here to see articles from the same news websites occasionally forming groups, suggesting that news websites often post multiple articles with similar content, which is plausible considering how news story unfold over time.


What’s more interesting is finding clusters where multiple new websites are talking about similar things. Below is one such cluster with the article headlines displayed, which mostly relate to the recent flooding in Cumbria.

更有意思的是,找到多个新网站都在谈论类似话题的集群。 以下是显示标题栏的此类集群,其中大部分与最近坎布里亚郡的洪水有关。

Hierarchical clustering is often a useful step in exploratory data analysis, and this work gives some insight into what is possible with news article clustering from live RSS feeds. Future work will look to evaluate different clustering approaches in more detail by examining the quality of the clusters they produce.

分层聚类通常是探索性数据分析中的一个有用步骤,并且这项工作使您可以深入了解实时RSS源中新闻聚类的可能性。 未来的工作将通过检查它们产生的集群的质量来更详细地评估不同的集群方法。

其他方法 (Other Approaches)

In this series we have focused on describing the simplest approach of using flat files as an intermediate storage medium between the two languages. However it is worth briefly mentioning several other options that are available, such as:

在本系列文章中,我们集中于描述使用平面文件作为两种语言之间的中间存储介质的最简单方法。 但是,值得一提的是可以使用的其他几个选项,例如:

  • Using a database, such as , as a medium of storage instead of flat files.
  • Passing the results of a script execution in memory instead of writing to an intermediate file.
  • Running two persistent R and Python processes at once, and passing data between them. Libraries such as and provide one such way of doing this.
  • 使用等数据库作为存储介质,而不是平面文件。
  • 将脚本执行的结果传递到内存中,而不是写入中间文件。
  • 一次运行两个持久性R和Python进程,并在它们之间传递数据。 诸如和库提供了一种这样做的方式。

Each of these methods brings with it some additional pros and cons, and so the question of which is most suitable is often dependent on the application itself. As a first port of call though, using common flat file formats is a good place to start.

这些方法中的每一个都带来一些其他的利弊,因此哪个方法最合适的问题通常取决于应用程序本身。 但是,作为第一个调用端口,使用通用平面文件格式是一个不错的起点。

摘要 (Summary)

This post gave an extended example of how Mango have been using both Python and R to perform exploratory analysis around clustering news articles. We used the flat file air gap strategy described in in this series, and then automated the calling of Python from R by spawning a separate subprocess (described in the ). As can be seen with a bit of care around character encodings, this provides a straight forward approach to “bridging the language gap”, and allows multiple skillsets to be utilised when performing a piece of analysis.

这篇文章提供了一个扩展的示例,说明了Mango如何一直使用Python和R来围绕新闻类新闻进行探索性分析。 我们使用了本系列中介绍的平面文件气隙策略,然后通过产生一个单独的子进程(在介绍了),自动从R中调用Python。 从字符编码的注意点可以看出,这提供了一种直接的方法来“弥合语言鸿沟”,并允许在执行一项分析时利用多个技能组。




