Roy Hung


Generating Singapore News Headlines with a Recurrent Neural Network

  • #Tensorflow #Flask #Python
Singapore News Headlines Generator


In the field of natural language processing, recurrent neural networks (RNN) have shown to produce surprisingly intuitive results with even the most basic architecture. RNNs can be fascinating and spooky at the same time as the output generated from one may give the impression that the neural network holds some semblance of intelligence. When in fact, it is simply a highly effective pattern recognition machine. As a result, a litany of applications using RNNs (from writing Shakespearan plays to creating Dr. Seuss-style rhythmic verses) were spawned to explore, and demonstrate its unreasonable effectiveness.

Here, I will add to the collection of text-generating RNNs with one trained to create news headlines in the style of the Straits Times newspaper from Singapore. You can generate some headlines yourself using the trained network by clicking the Generate Headlines button above. You can also find my source code here.

Headlines Dataset

The raw dataset contains 500,597 URLs from the Straits Times Singapore newspaper site. Like many vendored applications, the Straits Times website (Drupal) follows a standard application stucture and hosts its sitemap file pubicly at https://straitstimes.com/sitemap.xml. From there, it is fairly quick and straightforward to scrape all URLs to form the dataset.

### Please Scrape Responsibly ###

from selenium import webdriver
from bs4 import BeautifulSoup
import time
import datetime
import re
import json

# Alter following code to use a different webdriver
# Change path to your own chromedriver version
driver = webdriver.Chrome('../bin/chromedriver')

sitemap_url = "https://www.straitstimes.com/sitemap.xml"
driver.get(sitemap_url)
sitemap_info = driver.find_elements_by_xpath("//div[@id='information']")[0].text
n_pages = int(re.findall(r'\d+', sitemap_info)[0])

today = datetime.datetime.today().strftime("%Y%m%d-%H%M%S")

with open('../data/st_sitemap_{}.csv'.format(today), 'w') as f:
    headers = "url,page"
    f.write(headers)
    f.write("\n")
    for page in list(range(1, n_pages+1)):
        sitemap_page = sitemap_url+"?page="+str(page)
        driver.get(sitemap_page)
        source = driver.page_source
        anchors = re.findall(r'<td><a href="(.*)" ', driver.page_source)
        print("Sitemap Page:{}".format(page))
        for i, a in enumerate(anchors): 
            row = "{},{}".format(a, str(page))
            f.write(row)
            f.write("\n")

For SEO purposes, most news agencies "slugify" their news article headlines and include them into the URL path. Take for example the news headline:

"New measures to help employers retain maids for a longer period"

The corresponding URL path turns out to be:

/singapore/manpower/new-measures-to-help-employers-retain-maids-for-a-longer-period

Obtaining the headlines dataset is a matter of cleaning the scraped URLs:

import pandas as pd

df = pd.read_csv("../data/stsitemap_XXXX.csv")

def clean_headline(url):
    '''Generate headlines from URL
    '''

    headline = url.split('/')[-1]
    headline = headline.split('-')
    # Caps first letter of every word
    headline = [ w[0].upper() + w[1:] if len(w) > 0 else w for w in headline ] 
    headline = ' '.join(headline)

    return headline

def clean_category(url):
    '''Generate categories from URL
    '''

    category = url.replace('https://www.straitstimes.com/','').split('/')[:-1]
    # Caps first letter of every word
    category = [ w[0].upper() + w[1:] if len(w) > 0 else w for w in category ] 
    category = '/'.join(category)

    return category

df['Headline'] = df['url'].apply(lambda x: clean_headline(x))
df['Category'] = df['url'].apply(lambda x: clean_category(x))

Headline Category Wordcount Charcount
World Briefs Forensic Firm Helping Fbi Unlock Iphone World 8 52
Colombia Arrests 11 In Sex Trafficking Ring World/Americas 7 43
Government Protesters Gear Up For Protests With Hong Kong Set To Mark Handover Asia/East-asia 13 78
Indonesia To Meet Asian Nations Over Maid Welfare Asia/Se-asia 8 49
Exclusive Titles A Breath Of Fresh Air Tech/Games-apps 7 38

Table 1 - Sample of cleaned dataset

After removing non-headline links and repetitive headlines (i.e., Top News This Week), we get a cleaned dataset ready for some exploratory data analysis. You can access full cleaning code in this notebook.

Exploring the Dataset

We present some summary statistics for the word count and character counts with and without stopwords

Fig. 1 - Histogram of Word and Character Count

The stopwords corpus is obtained from the nltk library:

import nltk
stopwords = set(stopwords.words('english'))

The stopword density in each headline averages to 22 percent, which is on the lower end of the typical 20-30 percent density commonly found in English text. This is unsurprising considering headlines tend to be written in a more condensed manner that packs more information in a shorter sentence. This leaves little room for unneeded stopwords (i.e. pronouns).

Fig. 2 - Histogram of Word and Character Count
Excluding Stopwords

Exploring the dataset by category, there are a total of 70 categories (including sub-categories) with 'Singapore' news being the largest, taking up 12 percent of all articles.

Fig. 3 - Straits Times news categories by article count

Diving into each category, we look at what are the most common words that characterizes the top 25 news segments. Click the sub-category titles below to see the corresponding word cloud.

  • Singapore
  • Singapore/Courts Crime
  • Singapore/Transport
  • Asia
  • Asia/East Asia
  • Asia/SE Asia
  • Asia/South Asia
  • Asia/Australia NZ
  • World
  • World/Middle East
  • World/United States
  • Business
  • Business/Economy
  • Business/Companies Markets
  • Lifestyle
  • Lifestyle/Arts
  • Lifestyle/Entertainment
  • Lifestyle/Food
  • Forum Letters In Print
  • Opinion
  • Politics
  • Sport
  • Sport/Football
  • Sport/Tennis
  • World/Europe

Fig. 3 - Most common words in headlines, by category

While most top keywords in the categories are quite predictable, one interesting finding would be that the most common word for the crime news segment turns out to be man. It is more than twice as likely to appear as compared to the word woman. How would this finding look like juxtaposed against official reported crime rates by gender in Singapore?

Details and code for the exploratory data analysis and wordclouds can be found in this notebook.

Training

For this particular version of the model deployed on this page, I used a simple RNN architecture - single gated recurrent unit (GRU) hidden layer with 1024 nodes. This is trained on a C5.large CPU instance across 3 epochs. For faster results, you can train using a GPU and replace the GRU layer with a CuDnnGRU layer. Using a p2.xlarge instance (AWS Deep Learning AMI) for GPU training, the overall training time took around an hour. A 2-layer CuDnnGRU network across 10 epochs would give better results (i.e. less grammatical mistakes).

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    '''
    Wrapper function to build keras TF RNN model.
    
    Parameters
    ----------
    vocab_size : int
        Character-level training. vocab_size refers to the number
        of unique characters found in training text.
    embedding_dim : int
        Output dimension of dense vector for embedding layer
    rnn_units : int
        Dimensionality of output space
    batch_size : int
        Batch size, # of samples processed before model is 
        updated
    
    Returns
    -------
    obj
        Returns a tensorflow model object
    '''
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(
            vocab_size, 
            embedding_dim,
            batch_input_shape=[batch_size, None]
        ),
        tf.keras.layers.GRU(
            units=rnn_units,
            return_sequences=True,
            recurrent_initializer='glorot_uniform',
            stateful=True,
            recurrent_activation='sigmoid',
        ),
        tf.keras.layers.Dense(vocab_size)
    ])
    return model

model = build_model(
    vocab_size = len(words),
    embedding_dim=256,
    rnn_units=1024,
    batch_size=64
)

model.compile(
    optimizer = tf.train.AdamOptimizer(),
    loss = loss
)

Despite having trained a more powerful network, the simpler CPU-trained model was deployed instead due to cost reasons. As I have recently discovered, to deploy a GPU-trained model, you'd need to use a GPU instance. In other words, you can't make predictions using GRU layers when the model is trained on CuDDnnGU layers (at least not without a great deal of hacking). Also, to speed up predictions so users don't have to wait too long to generate headlines, a single layer model was used. You can view the full training code here.

Some Thoughts

This has been an amusing exercise and the headlines generated have been pretty entertaining. Despite its simple architecture, the RNN did not disappoint in producing some genuinely intuitive and sometimes humourous results ("Obama Calls For Brexit"). However, it seems that beyond that, there is little more the vanilla RNN can do with this dataset. Training at higher epochs, or with added hidden layers only improves its grammar. And of course, while there are no expectations that this network can generate any meaningful headlines, its uncanny ability to mimick what might be considered meaningful will continue to fascinate me.