Plan

Data Preprocessing
Select localization reviews
Detect language and translate the reviews
1. Determine language
2. Translate the texts
Sentiment Analysis
EDA
Conclusion

Imports and preprocessing

df = pd.read_csv('/content/drive/My Drive/data/df.csv', index_col=0)

df.columns = ['id', 'score', 'text']

df.shape

(1869775, 3)

df.head()

df = df.dropna()

df = df.drop_duplicates().reset_index(drop=True)

Select localization reviews

keywords = pd.read_excel('/content/drive/My Drive/data/Keywords.xlsx')

keywords

keywords = keywords.drop(29, axis=0)

keywords['Language combination'] = keywords['Language combination'].str.replace('English - ', '')

keywords_list = keywords['"Translation"'].tolist() + keywords['"Language"'].tolist() + keywords['"Localization"'].tolist() + keywords['"English"'].tolist() + keywords['Language'].tolist() + keywords['Language combination'].tolist()

from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
stemmer_rus = SnowballStemmer('russian')
keywords_stem = [stemmer.stem(word) for word in keywords_list]

keywords_stem = keywords_stem + ['翻译', '语言', '中文',  '中国',  '汉语', '本土化', '英语', '版本',
        '区域', '跨地区', '地区', '华语', '汉化', '英文', '国家', '国服', 'VPN', 'русс', 'англ']

keywords_stem = [keywords_stem[i].replace(' ', '') for i in range(len(keywords_stem))]

keywords_string = '|'.join(keywords_stem)

df['is_about_localization'] = df['text'].str.lower().str.contains(keywords_string.lower())

df.head()

df_loc = df[df['is_about_localization']==True]

df_loc.shape

(36031, 4)

df_loc

Detect language and translate the reviews

Determine language

def clear_text(text):
    new_text = re.sub(r"[^a-zA-Z']", " ", text) 
    new_text = new_text.split()
    new_text = " ".join(new_text)
    return new_text.lower()

def determine_language(text):
    try:
      return langdetect.detect(text)
    except:
      return None

df_loc['language'] = df_loc['text'].apply(determine_language)

df_loc['language'].unique()

array(['zh-cn', 'fr', 'ko', 'ja', 'pt', 'pl', 'ru', 'en', 'tr', 'de',
       'vi', 'it', 'no', 'th', 'es', 'hu', 'zh-tw', 'tl', 'af', 'et',
       'cs', 'cy', 'hi', 'sl', 'so', 'sk', 'fi', 'ar', 'id', 'ca', 'nl',
       'el', 'he', 'da', 'bg', 'mk', 'uk', 'sq', 'hr', 'sv', 'lv', 'lt',
       'mr', 'ro', None, 'sw'], dtype=object)

df_loc[df_loc['language'].isnull()]

df_loc['language'] = df_loc['language'].fillna('en')

df_loc.groupby('language')['language'].count().sort_values()

language
mr          1
lt          3
hi          3
lv          3
mk          3
sw         10
uk         11
hr         11
hu         12
sk         15
sq         15
cy         17
ro         17
bg         21
he         22
fi         29
da         31
sv         35
sl         38
ca         40
et         41
el         46
no         50
so         52
tl         83
af         99
vi        108
cs        120
zh-tw     167
nl        316
pl        772
ar        796
ja        804
es        884
it       1367
de       1405
fr       1498
id       1583
ko       1611
th       1731
ru       2579
zh-cn    3133
pt       3754
tr       4509
en       8186
Name: language, dtype: int64

df_loc['language'] = df_loc['language'].str.replace('zh-tw', 'zh')
df_loc['language'] = df_loc['language'].str.replace('zh-cn', 'zh')

table_lang_and_codes = keywords[['Language combination', 'ISO 639-1']]

table_lang_and_codes['ISO 639-1'] = table_lang_and_codes['ISO 639-1'].str.lower()

table_lang_and_codes.columns = ['language_full', 'language']

table_lang_and_codes['language'] = table_lang_and_codes['language'].drop_duplicates()

table_lang_and_codes = table_lang_and_codes.dropna()

table_lang_and_codes['language_full'] = table_lang_and_codes['language_full'].str.replace('Chinese Simplified', 'Chinese')

table_lang_and_codes['language_full'] = table_lang_and_codes['language_full'].where(table_lang_and_codes['language_full'] != 'Portuguese (BRZ)', 'Portuguese')

table_lang_and_codes

df_loc_new = df_loc.merge(table_lang_and_codes, on='language', how='left')

df_loc_new.head()

df_loc_new[df_loc_new['language_full'].isnull()].groupby('language')['language'].count()

language
af    99
bg    21
ca    40
cy    17
et    41
hr    11
hu    12
lt     3
lv     3
mk     3
mr     1
ro    17
sk    15
sl    38
so    52
sq    15
sw    10
tl    83
uk    11
Name: language, dtype: int64

There are some reviews on languages we didn't have in keywords lists, but they include words are common with languages we do have in keywords. We can translate them to english to process them futher.

df_loc_known = df_loc_new.dropna()

df_loc_known.shape

(35539, 6)

Translate the texts

# translator = GoogleTranslator(source='auto', target='en')

# def translate_to_en(text):
#   try:
#       return translator.translate(text)
#   except:
#       return None
  
# df_loc_known['en_trans'] = df_loc_known['text'].apply(translate_to_en)

df_loc_known = pd.read_csv('/content/drive/My Drive/data/df_loc_known.csv', index_col=0)

df_loc_known[df_loc_known['en_trans'].isnull()]['language'].count()

343

We have only 343 reviews without translation. They can be used for test our future model.

df_final = df_loc_known[df_loc_known['en_trans'].isnull()==False]

Sentiment Analysis

nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...

True

from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

df_final['sentiment_vader'] = df_final['en_trans'].apply(lambda text: sid.polarity_scores(text))

for i in range(20,30):
  print('Row ', i)
  print(df_final.loc[i, 'en_trans'])
  print(df_final.loc[i, 'sentiment_vader']['compound'])
  print('')

Row  20
I'm from Cooking Fever. If you translate it into Japanese, it will be 5 stars without complaint.
0.2235

Row  21
Thank you for your support in Japanese!
0.6696

Row  22
I hope you can make it in Japanese
0.4404

Row  23
In Japanese please!
0.3802

Row  24
Make it Japanese! I don't know what it means!
0.0

Row  25
Very addictive! I loved. I just thought the delay in filling the fuel was bad and little chance of winning diamonds. They did not put the language in Portuguese. Improve this please, the game has everything to be a success ❤
0.9142

Row  26
The game is sensational but could have the option of Portuguese
0.0

Row  27
No Polish language
-0.296

Row  28
It's hard to get at least 1 star without diamond products, especially above level 30. The lack of the Polish language does not bother me. Interesting but not easy game.
-0.6796

Row  29
It is a pity that the menu is not in Polish, but the overall sound is great
0.7227

df_final['compound'] = df_final['en_trans'].apply(lambda text: sid.polarity_scores(text)['compound'])

df_final['score_sentiment'] = pd.cut(df_final['compound'], 5, labels=[1, 2, 3, 4, 5]).astype(int)

Compound is a balanced valuation of a sentiment, that takes in account both negative and posive values. That is why it's used for further analysis.

EDA

Discover the score mean

df_final['is_pos'] =  (df_final['compound'] >= 0).astype(int)

games_with_zero = df_final[(df_final['score'] < 1) & (df_final['id'].count() > 1)]['id'].unique().tolist()

for i in games_with_zero:
  print(i, 'Scores:', df_final[df_final['id'] == i]['score'].unique(), 'Number of reviews:', df_final[df_final['id'] == i]['id'].count())
  print('')

All games that have 0 as a score have score 1 in addition. Probably there was kind of "do you like/ not like our game" question in a game. That's why there are 22 games with only 2 scores.

avrg_score = pd.DataFrame(df_final.groupby('id')['score'].mean())
avrg_score_sentiment = pd.DataFrame(df_final.groupby('id')['score_sentiment'].mean())

avrg_compound_sentiment = pd.DataFrame(df_final.groupby('id')['is_pos'].mean())

scores_all = avrg_score[(~avrg_score.index.isin(games_with_zero)) & (df_final.groupby('id')['score'].count() > 1)].merge(avrg_score_sentiment, on='id')

scores_all = scores_all.round(3)

scores_all.sort_values(by='score_sentiment').head()

compound_all = avrg_score[avrg_score.index.isin(games_with_zero)].merge(avrg_compound_sentiment, on='id')

compound_all = compound_all.round(3)

compound_all.sort_values(by='is_pos').head()

scores_all[:50].plot.bar(figsize=(30, 10))

<matplotlib.axes._subplots.AxesSubplot at 0x7f8952464590>

review_count = df_final[~df_final['id'].isin(games_with_zero)].groupby('id')['text'].count().reset_index()

review_count.columns = ['id', 'text_count']

review_count.head()

scores_all = scores_all.merge(review_count, on='id').sort_values(by=['score_sentiment'], ascending=True)

scores_all[:10]

Discover the reviews count

df_final.head()

neg_review_count = df_final[df_final['is_pos']==0].groupby('id')['text'].count().reset_index().sort_values(by='text', ascending=False)
neg_review_count.head(10)

There are 10 games with most count of negative location reviews. Let's see what languages asked more.

Usually the primary language for a new game is an English, so I assume that people that wrote reviews in English are not asked about the translation. Of course it's not true, but let's do it to have a fast overview.

df_to_dashboard_negative_top = df_final[(df_final['id'].isin(neg_review_count['id'].tolist()[:10])) & (df_final['is_pos']==0) & (df_final['language_full']!='English')][['id', 'language_full', 'score_sentiment', 'is_pos']]

for i in neg_review_count['id'].tolist()[:10]:
  df = df_final[(df_final['id']==i) & (df_final['is_pos']==0) & (df_final['language_full']!='English')].groupby('language_full')['text'].count().reset_index().sort_values(by='text', ascending=False)
  print("game:", i)
  print(df.head(10))
  print('')

  top_lang = df['language_full'][:4]
  df_to_dashboard_negative_top['language_full'] = np.where(~(df_to_dashboard_negative_top['language_full'].isin(top_lang)) & (df_to_dashboard_negative_top['id'] == i), 'Other', df_to_dashboard_negative_top['language_full'])

game: title249
   language_full  text
0        Chinese   299
5         German   102
12       Russian    74
6     Indonesian    36
16       Turkish    31
4         French    22
9         Korean    17
13       Spanish    16
10        Polish    15
11    Portuguese    12

game: title658
   language_full  text
11       Russian   232
3         French   186
14       Turkish   112
6        Italian    63
4         German    42
8         Korean    23
12       Spanish    11
5     Indonesian     9
9         Polish     8
1        Chinese     7

game: title885
   language_full  text
15       Turkish   163
14          Thai    81
5     Indonesian    64
8         Korean    50
1        Chinese    40
0         Arabic    28
10        Polish    27
11    Portuguese    15
7       Japanese    14
12       Russian    14

game: title700
   language_full  text
20       Turkish   134
12        Korean    61
11      Japanese    58
0         Arabic    36
9     Indonesian    29
16       Russian    25
19          Thai    24
5         French    23
6         German    22
1        Chinese    18

game: title407
   language_full  text
0        Chinese   111
10       Russian    43
7         Korean    22
13       Turkish    15
4         French     9
6       Japanese     9
9     Portuguese     7
11       Spanish     6
8         Polish     5
1         Czech      1

game: title832
   language_full  text
8         Polish    46
1        Chinese    43
4     Indonesian    30
13       Turkish    30
7         Korean    21
5        Italian    16
12          Thai    15
0         Arabic    12
10       Russian    11
9     Portuguese    10

game: title675
   language_full  text
10       Russian    79
3         German    58
5        Italian    36
8         Polish    10
9     Portuguese     8
11       Spanish     8
2         French     5
6       Japanese     3
0        Chinese     2
7         Korean     2

game: title765
   language_full  text
15       Turkish    35
10        Polish    23
1        Chinese    19
9         Korean    18
5         German    12
6     Indonesian    12
12       Russian    11
16    Vietnamese     7
11    Portuguese     6
13       Spanish     5

game: title631
   language_full  text
1        Chinese    40
8         Korean    23
14       Turkish    22
5     Indonesian    11
9         Polish    10
13          Thai    10
10    Portuguese     7
4         German     5
7       Japanese     4
11       Russian     3

game: title655
   language_full  text
15       Turkish    66
0         Arabic     8
10        Polish     8
6     Indonesian     7
12       Russian     7
7        Italian     6
11    Portuguese     5
1        Chinese     4
9         Korean     4
13       Spanish     4

Top languages among localization reviews

10 games with most count of reviews about localization

popular_games_without_loc = scores_all.sort_values(by=['text_count'], ascending=False)['id'][:10].tolist()

popular_games_without_loc

['title885',
 'title658',
 'title700',
 'title832',
 'title655',
 'title675',
 'title765',
 'title631',
 'title600',
 'title782']

df_final[df_final['id'].isin(popular_games_without_loc)].groupby('language_full')['language_full'].count().sort_values(ascending=False)

language_full
Turkish       3175
Portuguese    1649
English       1547
Thai          1480
Russian       1454
Indonesian    1204
French         898
Korean         809
Italian        792
Arabic         673
German         626
Chinese        624
Polish         550
Japanese       377
Spanish        281
Dutch          115
Vietnamese      69
Czech           53
Greek           41
Norwegian       18
Hebrew          12
Swedish         11
Danish          10
Finnish          9
Hindi            2
Name: language_full, dtype: int64

df_final.groupby('language_full')['language_full'].count().sort_values(ascending=False)

language_full
English       8000
Turkish       4510
Portuguese    3755
Chinese       3260
Russian       2537
Thai          1729
Korean        1596
Indonesian    1576
French        1491
German        1390
Italian       1367
Spanish        875
Japanese       798
Arabic         794
Polish         767
Dutch          314
Czech          118
Vietnamese     109
Norwegian       49
Greek           46
Swedish         34
Finnish         33
Danish          32
Hebrew          21
Hindi            3
Name: language_full, dtype: int64

Conclusion

We do not really know the reasons for negative reviews. VADER rely on words' emotional, so the review is negative because the reviewer used negative emotion words.

We do know the sentiment of all reviews. So we can assume that a negative review tells about a problem with a language the review was written on.

The languages, the most localization reviews are written, are English, Turkish, Portuguese, Thai, Russian, Chinese, Korean, French, German, Italian and Indonesian.

Data Studio Dashboard

	id	score	text
0	title777	4	Velmi návyková hra, skvěle zpracovaná. Jen ško...
1	title777	4	Hra hezká, vadí mi, že třeba přes noc se nepln...
2	title777	5	Cool
3	title777	5	Suprová hra.
4	title777	5	⚽️🏀🏈⚾️🥎去浏览器打开📲87ṚмＢ,C☻м📲华人最大娱乐站，开户免费送现金100块‼️-...

	Language combination	"Translation"	"Language"	"Localization"	"English"	Language	Unnamed: 6	ISO 639-1	ISO 3166-1-alpha-2	Unnamed: 9
0	English - French	Traduction	la langue	localisation	anglais	Français	NaN	FR	FR	NaN
1	English - German	Übersetzung	Sprache	Lokalisierung	Englisch	Deutsch	NaN	DE	DE	NaN
2	English - Italian	traduzione	lingua	localizzazione	inglese	Italiano	NaN	IT	IT	NaN
3	English - Spanish	traducción	idioma	localización	inglés	español	NaN	ES	ES	NaN
4	English - Chinese Simplified	翻译	语言	本土化	英语	中国	NaN	ZH	CN	NaN
5	English - Chinese Traditional	翻譯	語言	本土化	英語	中國	NaN	ZH	CN	NaN
6	English - Japanese	翻訳	言語	局地化	英語	日本	NaN	JA	JP	NaN
7	English - Korean	번역	언어	지방화	영어	한국	NaN	KO	KR	NaN
8	English - Portuguese (BRZ)	tradução	língua	localização	inglês	Portugues	NaN	PT	BR	NaN
9	English - Portuguese (EUR)	tradução	língua	localização	inglês	Portugues	NaN	PT	PT	NaN
10	English - Polish	tłumaczenie	język	Lokalizacja	angielski	polski	NaN	PL	PL	NaN
11	English - Turkish	çeviri	dil	yerelleştirme	İngilizce	Türk	NaN	TR	TR	NaN
12	English - Arabic	ترجمة	لغة	الموقع	الإنجليزية	العربية	NaN	AR	EG, SA, AE	NaN
13	English - Russian	Перевод	язык	локализация	английский	русский	NaN	RU	RU, BY, KZ, KG	NaN
14	English - Dutch	vertaling	taal	lokalisatie	Engels	Nederlands	NaN	NL	NL	NaN
15	English - Swedish	översättning	språk	lokalisering	engelska	svenska	NaN	SV	SE	NaN
16	English - Danish	oversættelse	Sprog	lokalisering	engelsk	dansk	NaN	DA	DK	NaN
17	English - Czech	překlad	Jazyk	lokalizace	Anglický	Česky	NaN	CS	CZ	NaN
18	English - Finnish	käännös	Kieli	lokalisointi	englanti	suomalainen	NaN	FI	FI	NaN
19	English - Greek	μετάφραση	Γλώσσα	εντοπισμός	αγγλικά	Ελληνικά	NaN	EL	GR	NaN
20	English - Norwegian	oversettelse	Språk	lokalisering	engelsk	norsk	NaN	NO	NO	NaN
21	English - Icelandic	þýðing	tungumál	staðsetning	ensku	Íslensku	NaN	IS	IS	NaN
22	English - Vietnamese	dịch	ngôn ngữ	nội địa hoá	tiếng Anh	Tiếng Việt	NaN	VI	VN	NaN
23	English - Hindi	अनुवाद	भाषा	स्थानीयकरण	अंग्रेजी	हिन्दी	NaN	HI	IN	NaN
24	English - Indonesian	terjemahan	bahasa	penyetempatan	bahasa inggris	indonesia	NaN	ID	ID	NaN
25	English - Thai	การแปล	ภาษา	การ จำกัด	อังกฤษ	ไทย	NaN	TH	TH	NaN
26	English - Malayan	terjemahan	bahasa	penyetempatan	bahasa inggeris	Melayu	NaN	MS	MY, SG	NaN
27	English - Hebrew	תִרגוּם	שָׂפָה	לוקליזציה	אנגלית	עברית	NaN	HE	IL	NaN
28	English - English	translation	language	localization	english	english	NaN	EN	US, CA, AU, GB, IE, NZ	NaN
29	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	id	score	text	is_about_localization
0	title777	4	Velmi návyková hra, skvěle zpracovaná. Jen ško...	False
1	title777	4	Hra hezká, vadí mi, že třeba přes noc se nepln...	False
2	title777	5	Cool	False
3	title777	5	Suprová hra.	False
4	title777	5	⚽️🏀🏈⚾️🥎去浏览器打开📲87ṚмＢ,C☻м📲华人最大娱乐站，开户免费送现金100块‼️-...	False

	id	score	text	is_about_localization
8	title777	1	没有华语版本	True
11	title777	5	没有中文字.很难玩-很好玩的游戏.只可惜语言不明	True
176	title777	5	Excellent en français et connexion Facebook sa...	True
967	title777	5	재밌어요. 영어까막눈이라 한국어 패치 됐음 좋겠어요.	True
968	title777	5	연료.. 너무안주고ㅜ 한국어로좀 번역좀해주시지. .정보도없고ㅜㅜ 잼있긴 잼있음 ㅋㅋ	True
...	...	...	...	...
1560168	title642	5	HARİKA YOL YAPMA VE ARABALAR HARİKA ONDAN 5 YI...	True
1560176	title642	5	Oyun fanadil	True
1560184	title642	3	güzel gibi yol yapıp arabaları geçirmeye çalış...	True
1560757	title642	2	I really like this game but they should update...	True
1563921	title642	5	Like si viste el video de fernanflo +german ga...	True

	score	score_sentiment
id
title182	1.500	2.500
title697	4.667	2.833
title207	2.000	3.000
title83	1.000	3.000
title31	3.750	3.000

	language_full	language
0	French	fr
1	German	de
2	Italian	it
3	Spanish	es
4	Chinese	zh
6	Japanese	ja
7	Korean	ko
8	Portuguese	pt
10	Polish	pl
11	Turkish	tr
12	Arabic	ar
13	Russian	ru
14	Dutch	nl
15	Swedish	sv
16	Danish	da
17	Czech	cs
18	Finnish	fi
19	Greek	el
20	Norwegian	no
21	Icelandic	is
22	Vietnamese	vi
23	Hindi	hi
24	Indonesian	id
25	Thai	th
26	Malayan	ms
27	Hebrew	he
28	English	en

	score	is_pos
id
title451	0.889	0.540
title249	0.554	0.603
title508	0.843	0.604
title81	0.771	0.693
title417	0.685	0.697

	id	text
24	title249	2134
70	title658	723
109	title885	548
75	title700	527
40	title407	434
105	title832	267
73	title675	264
81	title765	187
66	title631	169
69	title655	137

	id	text_count
0	title101	171
1	title114	5
2	title13	4
3	title143	4
4	title146	43

	id	score	text	is_about_localization	language	language_full	en_trans	sentiment_vader	compound	score_sentiment	is_pos
0	title777	1	没有华语版本	True	zh	Chinese	No Chinese version	{'neg': 0.524, 'neu': 0.476, 'pos': 0.0, 'comp...	-0.2960	2	0
1	title777	5	没有中文字.很难玩-很好玩的游戏.只可惜语言不明	True	zh	Chinese	No Chinese characters. Hard to play-very fun g...	{'neg': 0.289, 'neu': 0.547, 'pos': 0.164, 'co...	-0.1280	3	0
2	title777	5	Excellent en français et connexion Facebook sa...	True	fr	French	Excellent in French and Facebook connection wi...	{'neg': 0.0, 'neu': 0.654, 'pos': 0.346, 'comp...	0.5719	4	1
3	title777	5	재밌어요. 영어까막눈이라 한국어 패치 됐음 좋겠어요.	True	ko	Korean	fun. It’s English black eyes, so I hope that t...	{'neg': 0.0, 'neu': 0.63, 'pos': 0.37, 'compou...	0.7930	5	1
4	title777	5	연료.. 너무안주고ㅜ 한국어로좀 번역좀해주시지. .정보도없고ㅜㅜ 잼있긴 잼있음 ㅋㅋ	True	ko	Korean	Don't give me fuel.. Please translate it into ...	{'neg': 0.056, 'neu': 0.668, 'pos': 0.276, 'co...	0.7391	5	1