Machine learning: How to create a recommendation engine

In this excerpt from the book “Pragmatic AI,” learn how to code recommendation engines based on machine learning in AWS, Azure, and Google Cloud

By Noah Gift

InfoWorld |

Machine learning tutorial: How to create a recommendation engine — Getty Images

What do Russian trolls, Facebook, and US elections have to do with machine learning? Recommendation engines are at the heart of the central feedback loop of social networks and the user-generated content (UGC) they create. Users join the network and are recommended users and content with which to engage. Recommendation engines can be gamed because they amplify the effects of thought bubbles. The 2016 US presidential election showed how important it is to understand how recommendation engines work and the limitations and strengths they offer.

AI-based systems aren’t a panacea that only creates good things; rather, they offer a set of capabilities. It can be incredibly useful to get an appropriate product recommendation on a shopping site, but it can be equally frustrating to get recommended content that later turns out to be fake (perhaps generated by a foreign power motivated to sow discord in your country).

This chapter covers recommendation engines and natural language processing (NLP), both from a high level and a coding level. It also gives examples of how to use frameworks, such as the Python-based recommendation engine Surprise, as well as instructions how to build your own. Some of the topics covered including the Netflix prize, singular-value decomposition (SVD), collaborative filtering, real-world problems with recommendation engines, NLP, and production sentiment analysis using cloud APIs.

The Netflix prize wasn’t implemented in production

Before “data science” was a common term and Kaggle was around, the Netflix prize caught the world by storm. The Netflix prize was a contest created to improve the recommendation of new movies. Many of the original ideas from the contest later turned into inspiration for other companies and products. Creating a $1 million data science contest back in 2006 sparked excitement that would foreshadow the current age of AI. In 2006, ironically, the age of cloud computing also began, with the launch of Amazon EC2.

The cloud and the dawn of widespread AI have been intertwined. Netflix also has been one of the biggest users of the public cloud via Amazon Web Services. Despite all these interesting historical footnotes, the Netflix prize-winning algorithm was never implemented into production. The winners in 2009, the BellKor’s Pragmatic Chaos team, achieved a greater than 10 percent improvement with a Test RMS of 0.867. The team’s paper describes that the solution is a linear blend of more than 100 results. A quote in the paper that is particularly relevant is “A lesson here is that having lots of models is useful for the incremental results needed to win competitions, but practically, excellent systems can be built with just a few well-selected models.”

The winning approach for the Netflix competition was not implemented in production at Netflix because the engineering complexity was deemed too great when compared with the gains produced. A core algorithm used in recommendations, SVD, as noted in “Fast SVD for Large- Scale Matrices,” “though feasible for small data sets or offline processing, many modern applications involve real-time learning and/or massive data set dimensionality and size.” In practice, this is one of huge challenges of production machine learning: the time and computational resources necessary to produce results.

I had a similar experience building recommendation engines at companies. When an algorithm is run in a batch manner, and it is simple, it can generate useful recommendations. But if a more complex approach is taken, or if the requirements go from batch to real time, the complexity of putting it into production and/or maintaining it explodes. The lesson here is that simpler is better: choosing to do batch-based machine learning versus real-time. Or choosing a simple model versus an ensemble of multiple techniques. Also, deciding whether it may make sense to call a recommendation engine API versus creating the solution yourself.

Key concepts in recommendation systems

Figure 1 shows a social network recommendation feedback loop. The more users a system has, the more content it creates. The more content that is created, the more recommendations it creates for new content. This feedback loop, in turn, drives more users and more content. As mentioned at the beginning of this chapter, these capabilities can be used for both positive and negative features of a platform.

pragmatic ai fig1 — Figure 1: The social network recommendation feedback loop

Using the Surprise framework in Python

One way to explore the concepts behind recommendation engines is to use the Surprise framework. A few of the handy things about the framework are that it has built-in data sets—MovieLens and Jester—and it includes SVD and other common algorithms including similarity measures. It also includes tools to evaluate the performance of recommendations in the form of root mean squared error (RMSE) and mean absolute error (MAE), as well as the time it took to train the model.

Here is an example of how it can be used in a pseudo production situation by tweaking one of the provided examples.

First are the necessary imports to get the library loaded:

In [2]: import io
   ...: from surprise import KNNBaseline
   ...: from surprise import Dataset
   ...: from surprise import get_dataset_dir
   ...: import pandas as pd

A helper function is created to convert IDs to names:

In [3]: ...: def read_item_names():
   ...: """Read the u.item file from MovieLens
       100-k dataset and return two
   ...:  mappings to convert raw ids
       into movie names and movie names into raw ids.
   ...: """
   ...:
   ...: file_name = get_dataset_dir() + '/ml-100k/ml-100k/u.item'
   ...: rid_to_name = {}
   ...: name_to_rid = {}
   ...: with io.open(file_name, 'r', encoding='ISO-8859-1') as f:
   ...:   for line in f:
   ...:   line = line.split('|')
   ...:   rid_to_name[line[0]] = line[1]
   ...:   name_to_rid[line[1]] = line[0]
   ...:
   ...: return rid_to_name, name_to_rid

Similarities are computed between items:

In [4]: # First, train the algorithm
         # to compute the similarities between items
   ...: data = Dataset.load_builtin('ml-100k')
   ...: trainset = data.build_full_trainset()
   ...: sim_options = {'name': 'pearson_baseline',
         'user_based': False}
   ...: algo = KNNBaseline(sim_options=sim_options)
   ...: algo.fit(trainset)
   ...:
   ...:
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Out[4]: < surprise.prediction_algorithms.knns.KNNBaseline>

Finally, ten recommendations are provided, which are similar to another example in this chapter:

In [5]:
   ...: rid_to_name, name_to_rid = read_item_names()
   ...:
   ...: # Retrieve inner id of the movie Toy Story
   ...: toy_story_raw_id = name_to_rid['Toy Story (1995)']
   ...: toy_story_inner_id = algo.trainset.to_inner_iid(
         toy_story_raw_id)
   ...:
   ...: # Retrieve inner ids of the nearest neighbors of Toy Story.
   ...: toy_story_neighbors = algo.get_neighbors(
        toy_story_inner_id, k=10)
   ...:
   ...: # Convert inner ids of the neighbors into names.
   ...: toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id)
   ...:    for inner_id in toy_story_neighbors)
   ...: toy_story_neighbors = (rid_to_name[rid]
   ...:    for rid in toy_story_neighbors)
   ...:
   ...: print('The 10 nearest neighbors of Toy Story are:')
   ...: for movie in toy_story_neighbors:
   ...:   print(movie)
   ...:

The 10 nearest neighbors of Toy Story are:
Beauty and the Beast (1991)
Raiders of the Lost Ark (1981)
That Thing You Do! (1996)
Lion King, The (1994)
Craft, The (1996)
Liar Liar (1997)
Aladdin (1992)
Cool Hand Luke (1967)
Winnie the Pooh and the Blustery Day (1968)
Indiana Jones and the Last Crusade (1989)

In exploring this example, consider the real-world issues with implementing this in production. Here is an example of a pseudocode API function that someone in your company may be asked to produce:

def recommendations(movies, rec_count):
   """Your
   return recommendations"""

movies = ["Beauty and the Beast (1991)", "Cool Hand Luke (1967)",.. ]

print(recommendations(movies=movies, rec_count=10))

Some questions to ask in implementing this are: What trade-offs are you making in picking the top from a group of selections versus just a movie? How well will this algorithm perform on a very large data set? There are no right answers, but these are things you should think about as you deploy recommendation engines into production.

Cloud solutions to recommendation systems

The Google Cloud Platform has an example of using machine learning on Compute Engine to make product recommendations that is worth exploring. In the example, PySpark and the ALS algorithm are used along with proprietary cloud SQL. Amazon also has an example of how to build a recommendation engine using its platform, Spark, and Elastic Map Reduce (EMR).

In both cases, Spark is used to increase the performance of the algorithm by dividing the computation across a cluster of machines. Finally, AWS is heavily pushing SageMaker, which can do distributed Spark jobs natively or talk to an EMR cluster.

Real-world production issues with recommendations

Most books and articles on recommendation focus purely on the technical aspects of recommendation systems. This book is about pragmatism, and so there are some issues to talk about when it comes to recommendation systems. A few of these topics are covered in this section: performance, ETL, user experience (UX), and shills/bots.

One of the most popular algorithms as discussed is O(n_samples^2 * n_features) or quadratic. This means that it is very difficult to train a model in real time and get an optimum solution. Therefore, training a recommendation system will need to occur as a batch job in most cases, without some tricks like using a greedy heuristic and/or only creating a small subset of recommendations for active users, popular products, etc.

When I created a user-follow recommendation system from scratch for a social network, I found many of these issues came front and center. Training the model took hours, so the only realistic solution was to run it nightly. Additionally, I later created an in-memory copy of our training data, so the algorithm was only bound on CPU, not I/O.

Performance is a nontrivial concern in creating a production recommendation system in both the short term and the long term. It is possible that the approach you initially use may not scale as your company grows users and products. Perhaps initially, a Jupyter Notebook, Pandas, and SciKit-Learn were acceptable when you had 10,000 users on your platform, but it may turn out quickly to not be a scalable solution.

Instead, a PySpark-based support vector machine training algorithm may dramatically improve performance and decrease maintenance time. And then later, again, you may need to switch to dedicated machine learning chips like TPU or the Nvidia Volta. Having the ability to plan for this capacity while still making initial working solutions is a critical skill to have to implement pragmatic AI solutions that actually make it to production.

Real-world recommendation problems: Integration with production APIs

I found many real-world problems surface in production in startups that build recommendations. These are problems that are not as heavily discussed in machine learning books. One such problem is the cold-start problem. In the examples using the Surprise framework, there is already a massive database of “correct answers.” In the real world, you have so few users or products it doesn’t make sense to train a model. What can you do?

A decent solution is to make the path of the recommendation engine follow three phases. For phase one, take the most popular users, content, or products and serve those out as a recommendation. As more UGC is created on the platform, for phase two, use similarity scoring (without training a model). Here is some hand-coded code I have used in production a couple of different times that did just that. First, I have a Tanimoto score, or Jaccard distance, by another name.

"""Data Science Algorithms"""
def tanimoto(list1, list2):
   """tanimoto coefficient

   In [2]: list2=['39229', '31995', '32015']
   In [3]: list1=['31936', '35989', '27489',
        '39229', '15468', '31993', '26478']
   In [4]: tanimoto(list1,list2)
   Out[4]: 0.1111111111111111

   Uses intersection of two sets to determine numerical score
   """

   intersection = set(list1).intersection(set(list2))
   return float(len(intersection))/(len(list1)) +\
      len(list2) - len(intersection)

Next is HBD: Here Be Dragons. Follower relationships are downloaded and converted in a Pandas DataFrame.

import os
import pandas as pd

from .algorithms import tanimoto

def follows_dataframe(path=None):
   """Creates Follows Dataframe"""
   if not path:
      path = os.path.join(os.getenv('PYTHONPATH'),
         'ext', 'follows.csv')
   df = pd.read_csv(path)
   return df

def follower_statistics(df):
   """Returns counts of follower behavior

   In [15]: follow_counts.head()
      Out[15]:
      followerId
      581bea20-962c-11e5-8c10-0242528e2f1b    1558
      74d96701-e82b-11e4-b88d-068394965ab2      94
      d3ea2a10-e81a-11e4-9090-0242528e2f1b      93
      0ed9aef0-f029-11e4-82f0-0aa89fecadc       88
      55d31000-1b74-11e5-b730-0680a328ea36      64
      Name: followingId, dtype: int64
   """

   follow_counts = df.groupby(['followerId'])['followingId'].\
      count().sort_values(ascending=False)
   return follow_counts

def follow_metadata_statistics(df):
   """Generates metadata about follower behavior

   In [13]: df_metadata.describe()
      Out[13]:
      count         2145.000000
      mean             3.276923
      std             33.961413
      min              1.000000
      25%              1.000000
      50%              1.000000
      75%              3.000000
      max           1558.000000
      Name: followingId, dtype: float64
   """

   dfs = follower_statistics(df)
   df_metadata = dfs.describe()
   return df_metadata

def follow_relations_df(df):
   """Returns a dataframe of follower with all relations"""

   df = df.groupby('followerId').followingId.apply(list)
   dfr = df.to_frame("follow_relations")
   dfr.reset_index(level=0, inplace=True)
   return dfr

def simple_score(column, followers):
   """Used as an apply function for dataframe"""

   return tanimoto(column,followers)

def get_followers_by_id(dfr, followerId):
   """Returns a list of followers by followerID"""

   followers = dfr.loc[dfr['followerId'] == followerId]
   fr = followers['follow_relations']
   return fr.tolist()[0]

def generate_similarity_scores(dfr, followerId,
      limit=10, threshold=.1):
   """Generates a list of recommendations for a followerID"""
   followers = get_followers_by_id(dfr, followerId)
   recs = dfr['follow_relations'].\
      apply(simple_score, args=(followers,)).\
         where(dfr>threshold).dropna().sort_values()[-limit:]
   filters_recs = recs.where(recs>threshold)
    return filters_recs

def return_similarity_scores_with_ids(dfr, scores):
   """Returns Scores and FollowerID"""
   dfs = pd.DataFrame(dfr, index=scores.index.tolist())
   dfs['scores'] = scores[dfs.index]
   dfs['following_count'] = dfs['follow_relations'].apply(len)
   return dfs

To use this API, you would engage with it by following this sequence:

In [1]: follows import *

In [2]: df = follows_dataframe()

In [3]: dfr = follow_relations_df(df)

In [4]: dfr.head()

In [5]: scores = generate_similarity_scores(dfr,
         “00480160-0e6a-11e6-b5a1-06f8ea4c790f”)

In [5]: scores
Out[5]:
2144   0.000000
713    0.000000
714    0.000000
715    0.000000
716    0.000000
717    0.000000
712    0.000000
980    0.333333
2057   0.333333
3      1.000000
Name: follow_relations, dtype: float64

In [6]: dfs = return_similarity_scores_with_ids(dfr, scores)

In [6]: dfs
Out[6]:
                                followerId \
980   76cce300-0e6a-11e6-83e2-0242528e2f1b
2057  f5ccbf50-0e69-11e6-b5a1-06f8ea4c790f
3     00480160-0e6a-11e6-b5a1-06f8ea4c790f
                   follow_relations scores \
980   [f5ccbf50-0e69-11e6-b5a1-06f8ea4c790f, 0048016... 0.333333
2057  [76cce300-0e6a-11e6-83e2-0242528e2f1b, 0048016... 0.333333
3     [f5ccbf50-0e69-11e6-b5a1-068ea4c790f, 76cce30...        1

   following_count
980      2
2057     2
3       32

This “phase 2” similarity score-based recommendation with the current implementation would need to be run as a batch API. Additionally, Pandas will eventually run into some performance problems at scale. Ditching it at some point for either PySpark or Pandas on Ray is going to be a good move.

For “phase 3,” it is finally time to pull out the big guns and use something like Surprise and/or PySpark to train an SVD-based model and figure out model accuracy. In the first part of your company’s history, though, why bother when there is little to no value in doing formal machine learning model training?

Another production API issue is how to deal with rejected recommendations. There is nothing more irritating to a user than to keep getting recommendations for things you don’t want or already have. So, yet another sticky production issue needs to be solved. Ideally, the user is given the ability to click, “do not show again” for a list of recommendations, or quickly your recommendation engine becomes garbage. Additionally, the user is telling you something, so why not take that signal and feed it back into your recommendation engine model?

Cloud NLP and sentiment analysis

All three of the dominant cloud providers—Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure—have solid NLP engines that can be called via an API. In this section, NLP examples on all three clouds will be explored. Additionally, a real-world production AI pipeline for NLP pipeline will be created on AWS using serverless technology.

NLP on Microsoft Azure

Microsoft Cognitive Services has the Text Analytics API that has language detection, key phrase extraction, and sentiment analysis. In Figure 2, the endpoint is created so API calls can be made. This example takes a negative collection of movie reviews from the Cornell Computer Science Data Set on Movie Reviews and uses it to walk through the API.

pragmatic ai fig2 — Figure 2: The Microsoft Azure Cognitive Services API

First, imports are done in this first block in Jupyter Notebook:

In [1]: import requests
   ...: import os
   ...: import pandas as pd
   ...: import seaborn as sns
   ...: import matplotlib as plt
   ...:

Next, an API key is taken from the environment. This API key was fetched from the console shown in Figure 2 under the section keys and was exported as an environmental variable, so it isn’t hard-coded into code. Additionally, the text API URL that will be used later is assigned to a variable.

In [4]: subscription_key=os.environ.get("AZURE_API_KEY")
In [5]: text_analytics_base_url =\
   ...: https://eastus.api.cognitive.microsoft.com/\
         text/analytics/v2.0/

Next, one of the negative reviews is formatted in the way the API expects:

In [9]: documents = {"documents":[]}
   ...: path = "../data/review_polarity/\
         txt_sentoken/neg/cv000_29416.txt"
   ...: doc1 = open(path, "r")
   ...: output = doc1.readlines()
   ...: count = 0
   ...: for line in output
   ...:    count +=1
   ...:    record = {"id": count, "language": "en", "text": line}
   ...:    documents["documents"].append(record)
   ...:
   ...: #print it out
   ...: documents

The data structure with the following shape is created:

Out[9]:
{'documents': [{'id': 1,
   'language': 'en',
   'text': 'plot : two teen couples go to a\
      church party , drink and then drive . \n'},
  {'id': 2, 'language': 'en',
   'text': 'they get into an accident . \n'},
  {'id': 3,
   'language': 'en',
   'text': 'one of the guys dies ,\
 but his girlfriend continues to see him in her life,\
 and has nightmares . \n'},
  {'id': 4, 'language': 'en', 'text': "what's the deal ? \n"},
  {'id': 5,
   'language': 'en',

Finally, the sentiment anslysis API is used to score the individual documents:

{'documents': [{'id': '1', 'score': 0.5},
   {'id': '2', 'score': 0.13049307465553284},
   {'id': '3', 'score': 0.09667149186134338},
   {'id': '4', 'score': 0.8442018032073975},
   {'id': '5', 'score': 0.808459997177124

At this point, the return scores can be converted into a Pandas DataFrame to do some EDA. It isn’t a surprise that the median value of the sentiments for a negative review are 0.23 on a scale of 0 to 1, where 1 is extremely positive and 0 is extremely negative:

In [11]: df = pd.DataFrame(sentiments['documents'])

In [12]: df.describe()
Out[12]:
           score
count  35.000000
mean    0.439081
std     0.316936
min     0.037574
25%     0.159229
50%     0.233703
75%     0.803651
max     0.948562

This is further explained by doing a density plot. Figure 3 shows a majority of highly negative sentiments.

pragmatic ai fig3 — Figure 3: A density plot of sentiment scores

NLP on Google Cloud Platform

There is a lot to like about the Google Cloud Natural Language API. One of the convenient features of the API is that you can use it in two different ways: analyzing sentiment in a string, and also analyzing sentiment from Google Cloud Storage. Google Cloud also has a tremendously powerful command-line tool that makes it easy to explore its API. Finally, it has some fascinating AI APIs, some of which will be explored in this chapter: analyzing sentiment, analyzing entities, analyzing syntax, analyzing entity sentiment, and classifying content.

Exploring the Entity API

Using the command-line gcloud API is a great way to explore what one of the APIs does. In the example, a phrase is sent via the command line about LeBron James and the Cleveland Cavaliers:

→ gcloud ml language analyze-entities --content=\
"LeBron James plays for the Cleveland Cavaliers."
{
   "entities": [
     {
       "mentions": [
         {
            "text": {
                "beginOffset": 0,
                "content": "LeBron James"
             },
             "type": "PROPER" }
          ],
          "metadata": {
             "mid": "/m/01jz6d",
             "wikipedia_url": "https://en.wikipedia.org/wiki/LeBron_James"
          },
          "name": "LeBron James",
          "salience": 0.8991045,
          "type": "PERSON"
      },
      {
          "mentions": [
             {
                "text": {
                    "beginOffset": 27,
                    "content": "Cleveland Cavaliers"
                },
                "type": "PROPER" }
            ],
           "metadata": {
               "mid": "/m/0jm7n",
               "wikipedia_url": "https://en.wikipedia.org/\
       wiki/Cleveland_Cavaliers"
            },
            "name": "Cleveland Cavaliers",
            "salience": 0.100895494,
            "type": "ORGANIZATION"
        }
    ],
    "language": "en"
  }

A second way to explore the API is to use Python. To get an API key and authenticate, you need to follow Google’s instructions. Then launch the Jupyter Notebook in the same shell as the GOOGLE_APPLICATION_CREDENTIALS variable is exported:

→ ✗  export GOOGLE_APPLICATION_CREDENTIALS=\
      /Users/noahgift/cloudai-65b4e3299be1.json
→ ✗  jupyter notebook

Once this authentication process is complete, the rest is straightforward. First, the Python language API must be imported. (This can be installed via pip if it isn’t already: pip install --upgrade google-cloud-language.)

In [1]: # Imports the Google Cloud client library
   ...: from google.cloud import language
   ...: from google.cloud.language import enums
   ...: from google.cloud.language import types

Next, a phrase is sent to the API and entity metadata is returned with an analysis:

In [2]:
   ...: text = "LeBron James plays for the Cleveland Cavaliers."
   ...: client = language.LanguageServiceClient()
   ...: document = types.Document(
   ...:       content=text,
   ...:       type=enums.Document.Type.PLAIN_TEXT)
   ...: entities = client.analyze_entities(document).entities

The output has a similar look and feel to the command-line version, but it comes back as a Python list:

[name: "LeBron James" type: PERSON
metadata {
   key: "mid"
   value: "/m/01jz6d"
}
metadata {
   key: "wikipedia_url"
   value: "https://en.wikipedia.org/wiki/LeBron_James"
}
salience: 0.8991044759750366
mentions {
   text {
      content: "LeBron James"
      begin_offset: -1
   }
   type: PROPER
}
, name: "Cleveland Cavaliers"
type: ORGANIZATION
metadata {
   key: "mid"
   value: "/m/0jm7n" }
metadata {
   key: "wikipedia_url"
   value: "https://en.wikipedia.org/wiki/Cleveland_Cavaliers"
}
salience: 0.10089549422264099
mentions {
   text {
      content: "Cleveland Cavaliers"
      begin_offset: -1
   }
   type: PROPER
}
]

A few of the takeaways are that this API could be easily merged with some of the other explor tions done in Chapter 6, “Predicting Social-Media Influence in the NBA.” It wouldn’t be hard to imagine creating an AI application that found extensive information about social influencers by using these NLP APIs as a starting point. Another takeaway is that the command line given to you by the GCP Cognitive APIs is quite powerful.

Production serverless AI pipeline for NLP on AWS

One thing AWS does well, perhaps better than any of the Big Three clouds, is make it easy to create production applications that are easy to write and manage. One of its game-changer innovations is AWS Lambda. It is available to both orchestrate pipelines and serve HTTP endpoints, like in the case of chalice. In Figure 4, a real-world production pipeline is described for creating an NLP pipeline.

pragmatic ai fig4 — Figure 4: A production serverless NLP pipeline on AWS

To get started with AWS sentiment analysis, some libraries need to be imported:

In [1]: import pandas as pd
   ...: import boto3
   ...: import json

Next, a simple test is created:

In [5]: comprehend = boto3.client(service_name='comprehend')
   ...: text = "It is raining today in Seattle"
...: print('Calling DetectSentiment')
...: print(json.dumps(comprehend.detect_sentiment(\
Text=text, LanguageCode='en'), sort_keys=True, indent=4))
   ...:
   ...: print('End of DetectSentiment\n')
   ...:

The output shows a SentimentScore:

Calling DetectSentiment
{
   "ResponseMetadata": {
      "HTTPHeaders": {
         "connection": "keep-alive",
         "content-length": "164",
         "content-type": "application/x-amz-json-1.1",
         "date": "Mon, 05 Mar 2018 05:38:53 GMT",
         "x-amzn-requestid":\
   "7d532149-2037-11e8-b422-3534e4f7cfa2"
         },
         "HTTPStatusCode": 200,
         "RequestId": "7d532149-2037-11e8-b422-3534e4f7cfa2",
         "RetryAttempts": 0
   },
   "Sentiment": "NEUTRAL",
   "SentimentScore": {
         "Mixed": 0.002063251566141844,
         "Negative": 0.013271247036755085,
         "Neutral": 0.9274052977561951,
         "Positive": 0.057260122150182724
   }
}
End of DetectSentiment

Now, in a more realistic example, I’ll use the previous “negative movie reviews document” from the Azure example. The document is read in:

In [6]: path = "/Users/noahgift/Desktop/review_polarity/\
txt_sentoken/neg/cv000_29416.txt"
   ...: doc1 = open(path, "r")
   ...: output = doc1.readlines()
   ...:

Next, one of the “documents” (remember that each line is a document according to NLP APIs) is scored:

In [7]: print(json.dumps(comprehend.detect_sentiment(\
Text=output[2], LanguageCode='en'), sort_keys=True, inden
   ...: t=4))

{
   "ResponseMetadata": {
      "HTTPHeaders": {
         "connection": "keep-alive",
         "content-length": "158",
         "content-type": "application/x-amz-json-1.1",
         "date": "Mon, 05 Mar 2018 05:43:25 GMT",
         "x-amzn-requestid":\
"1fa0f6e8-2038-11e8-ae6f-9f137b5a61cb"
      },
      "HTTPStatusCode": 200,
      "RequestId": "1fa0f6e8-2038-11e8-ae6f-9f137b5a61cb",
      "RetryAttempts": 0
   },
   "Sentiment": "NEUTRAL", "SentimentScore": {
      "Mixed": 0.1490383893251419,
      "Negative": 0.3341641128063202,
      "Neutral": 0.468740850687027,
      "Positive": 0.04805663228034973
   }
}

It’s no surprise that that document had a negative sentiment score since it was previously scored this way. Another interesting thing this API can do is to score all of the documents inside as one giant score. Basically, it gives the median sentiment value. Here is what that looks like:

In [8]: whole_doc = ', '.join(map(str, output))

In [9]: print(json.dumps(\
comprehend.detect_sentiment(\
Text=whole_doc, LanguageCode='en'), sort_keys=True, inden
   ...: t=4))
{
   "ResponseMetadata": {
      "HTTPHeaders": {
         "connection": "keep-alive",
         "content-length": "158",
         "content-type": "application/x-amz-json-1.1",
         "date": "Mon, 05 Mar 2018 05:46:12 GMT",
         "x-amzn-requestid":\
"8296fa1a-2038-11e8-a5b9-b5b3e257e796"
      },
   "Sentiment": "MIXED",
   "SentimentScore": {
      "Mixed": 0.48351600766181946,
      "Negative": 0.2868672013282776,
      "Neutral": 0.12633098661899567,
      "Positive": 0.1032857820391655
   }
}='en'), sort_keys=True, inden
   ...: t=4))

An interesting takeaway is that the AWS API has some hidden tricks up its sleeve and has a nuisance that is missing from the Azure API. In the previous Azure example, the Seaborn output showed that, indeed, there was a bimodal distribution with a minority of reviews liking the movie and a majority disliking the movie. The way AWS presents the results as “mixed” sums this up quite nicely.

The only things left to do is to create a simple chalice app that will take the scored inputs that are written to Dynamo and serve them out. Here is what that looks like:

from uuid import uuid4
import logging
import time

from chalice import Chalice
import boto3
from boto3.dynamodb.conditions import Key
from pythonjsonlogger import jsonlogger

#APP ENVIRONMENTAL VARIABLES
REGION = "us-east-1"
APP = "nlp-api"
NLP_TABLE = "nlp-table"

#intialize logging
log = logging.getLogger("nlp-api")
LOGHANDLER = logging.StreamHandler()
FORMMATTER = jsonlogger.JsonFormatter()
LOGHANDLER.setFormatter(FORMMATTER)
log.addHandler(LOGHANDLER)
log.setLevel(logging.INFO)

app = Chalice(app_name='nlp-api')
app.debug = True

def dynamodb_client():
   """Create Dynamodb Client"""

   extra_msg = {"region_name": REGION, "aws_service": "dynamodb"}
   client = boto3.client('dynamodb', region_name=REGION)
   log.info("dynamodb CLIENT connection initiated", \
      extra=extra_msg)
   return client

def dynamodb_resource():
   """Create Dynamodb Resource"""

   extra_msg = {"region_name": REGION, "aws_service": "dynamodb"}
   resource = boto3.resource('dynamodb', region_name=REGION)
   log.info("dynamodb RESOURCE connection initiated",\
      extra=extra_msg) return resource

def create_nlp_record(score):
   """Creates nlp Table Record

   """

   db = dynamodb_resource()
   pd_table = db.Table(NLP_TABLE)
   guid = str(uuid4())
   res = pd_table.put_item(
      Item={
         'guid': guid,
         'UpdateTime' : time.asctime(),
         'nlp-score': score
      }
   )
   extra_msg = {"region_name": REGION, "aws_service": "dynamodb"}
   log.info(f"Created NLP Record with result{res}", extra=extra_msg)
   return guid

def query_nlp_record():
   """Scans nlp table and retrieves all records"""

   db = dynamodb_resource()
   extra_msg = {"region_name": REGION, "aws_service": "dynamodb",
      "nlp_table":NLP_TABLE}
   log.info(f"Table Scan of NLP table", extra=extra_msg)
   pd_table = db.Table(NLP_TABLE)
   res = pd_table.scan()
   records = res['Items']
   return records

@app.route('/')
def index():
   """Default Route"""

   return {'hello': 'world'}

@app.route("/nlp/list")
def nlp_list():
   """list nlp scores"""

   extra_msg = {"region_name": REGION,
      "aws_service": "dynamodb",
      "route":"/nlp/list"}
   log.info(f"List NLP Records via route", extra=extra_msg)
   res = query_nlp_record()
   return res

Summary

If data is the new oil, then UGC is the sand tar pits. Sand tar pits have been historically difficult to turn into production oil pipelines, but rising energy costs and advances in technology have allowed their mining to become feasible. Similarly, the AI APIs coming out of the Big Three cloud providers have created new technological breakthroughs in sifting through the “sandy data.” Also, prices for storage and computation have steadily dropped, making it much more feasible to convert UGC into an asset from which to extract extra value. Another innovation lowering the cost to process UGC is AI accelerators. Massive parallelization improvements by ASIC chips like TPUs, GPUs, and field-programmable graphic arrays (FPGAs) may make some of the scale issues discussed even less of an issue.

This chapter showed many examples of how to extract value from these tar pits, but there are also real trade-offs and dangers, as with the real sand tar pits. UGC to AI feedback loops can be tricked and exploited in ways that create consequences that are global in scale. Also, on a much more practical level, there are trade-offs to consider when systems go live. As easy as the cloud and AI APIs make creating solutions, the real trade-offs cannot be abstracted away, like UX, performance, and the business implications of implemented solutions.

Next read this: