User:Dr. Greywolf/Bot
原先個 bot 嘅 pseudocode:Dr. Greywolf (傾偈) 2022年6月27號 (一) 15:00 (UTC)
攞一個英維 navbox 模嘅 source code; 將每段用 [[ ]] 框住咗嘅文字,整做一個 array 記住; Foreach 個 array 入面嘅數據: 去英維摷段字對應嘅文; 去嗰篇文嘅中文版,攞香港版 title; 去粵典搵個漢字 title 嘅粵拼發音; 攞英維篇文頭嗰 5 句句子; 將拃句子用 microsoft 機翻翻咗佢;([1]) 用上述嘅資訊同篇文開 stub;
28-June-2022
[編輯]@Z423x5c6、Deryck Chan、H78c67c、Shinjiman: 啱啱寫咗部有限狀態機:
- Input - 一個 navbox template 嘅 wiki 碼;
- Output - 一個 array,store 住拃文嘅 title;
我啱啱攞咗英維個 en:Template:Natural language processing 做測試,段 code 一如所料會俾出以下嘅 output:
Dr. Greywolf (傾偈) 2022年6月27號 (一) 14:53 (UTC)
30-June-2022
[編輯]@Z423x5c6、Deryck Chan、H78c67c、Shinjiman: 啱啱加咗以下嘅碼:
- Input - 一個 array,store 住拃文嘅英文 title;
- Output - 一個 array,store 住拃文嘅預想粵文 title;
...
上述段碼做到「攞一個 array 做 input,個 array 列出嗮要搵嗰啲英維文嘅 title,再俾出個新 array,個新 array 列出啲文嘅預想粵維 title」:
- 我攞咗英維個樂理 navbox 嘅維基碼做 input;
- Output 一如所料係
['樂理', '音樂美學', '音樂分析', '音樂元素', '作曲', '音樂嘅定義', 'Music and mathematics', '音樂學', '音樂哲學', '音樂心理學', '音樂集合論', '調音', 'list of music theorists']
;
Dr. Greywolf (傾偈) 2022年6月29號 (三) 16:16 (UTC)
- 加咗啲
try
落去,so that 個 template 有紅拎都唔會搞到個程式行唔到。Dr. Greywolf (傾偈) 2022年6月30號 (四) 07:17 (UTC)
下一步嘅 plan:
- 教個程式「攞英維篇文頭嗰 5 句句子」;Dr. Greywolf (傾偈) 2022年6月30號 (四) 07:19 (UTC)
- 「教個程式搵粵拼」呢 part 唔係必要,所以延後處理,整好嗮個程式嘅其餘部份先搞。Dr. Greywolf (傾偈) 2022年6月30號 (四) 07:19 (UTC)
2-July-2022
[編輯]@Z423x5c6、Deryck Chan、H78c67c、Shinjiman: Update:
- Input - 一個 array,store 住拃文嘅英文 title;
- Output - Foreach 文,攞篇文頭嗰一部份內容;
...
做咗測試,用英維樂理 navbox 做測試,mostly working correctly。Dr. Greywolf (傾偈) 2022年7月1號 (五) 15:48 (UTC)
下一步嘅 plan:做機翻 ~~ Dr. Greywolf (傾偈) 2022年7月1號 (五) 15:46 (UTC)
@Z423x5c6、Deryck Chan、H78c67c、Shinjiman: 其他維基嘅人會唔會對呢件架生有興趣?Dr. Greywolf (傾偈) 2022年7月1號 (五) 16:04 (UTC)
喺度諗緊上面第 14 行(if '| ' in storaga or 'File:' in storaga ...
)嗰 part,好唔好改用人工神經網絡。Dr. Greywolf (傾偈) 2022年7月2號 (六) 01:52 (UTC)
- 其實可以用mw:Extension:TextExtracts#API(例:https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exchars=500&explaintext=1&titles=Artificial+intelligence )攞文章extract。H78c67c·傾偈 2022年7月2號 (六) 02:07 (UTC)
- @H78c67c: OK thank you。我低咗能添。XD Dr. Greywolf (傾偈) 2022年7月2號 (六) 02:15 (UTC)
3-July-2022
[編輯]@Z423x5c6、Deryck Chan、H78c67c、Shinjiman: Update:
- Input - 一個 array,store 住拃文嘅開頭;
- Output - 同個 array 入面每嚿內容加返「睇埋」等嘅必要嘢;
...
下一步嘅 plan:教個程式開文 + 將啲內容貼上去。 Dr. Greywolf (傾偈) 2022年7月2號 (六) 15:52 (UTC)
10-July-2022
[編輯]Full code 擺呢度
15-July-2022 Update Dr. Greywolf (傾偈) 2022年7月15號 (五) 13:30 (UTC)
Updated on 22-July-2022 Dr. Greywolf (傾偈) 2022年7月22號 (五) 07:34 (UTC)
Final update on 26-July-2022 Dr. Greywolf (傾偈) 2022年7月26號 (二) 02:32 (UTC)
"""
Created on Sat Jun 25 00:31:59 2022
@author: No one
"""
import pywikibot
import re
import docx
from docx import Document
import time
##### Declare the Article class
class Article:
eng_title = ''
canto_title = ''
content = ''
DoesCantoWikiHaveIt = False # This will be true if the Cantonese wiki has this article...
TitleNeedsManualTranslated = False # This will be true if no Kanji-based titles are available...
def __init__(self, eng_tit):
self.eng_title = eng_tit
state = 0
# state basically represents the number of '[' read,
# state is used only when reading the input file...
filename = "python_testing.txt"
# This above line stores the txt file that holds the input wiki code
topic_list = []
store = ""
text_file = open("python_testing.txt", "r")
navbox = text_file.read() # This one will store the navbox to be created.
##### ##### Read the input and create the array...
with open(filename) as f:
while True:
c = f.read(1)
if not c:
# print(topic_list)
break
if c == '[' and state == 0:
state = 1
#print('1') ## for debugging...
if c == '[' and state == 1:
state = 2 # if state == 2, then it is time to start reading...
#print('2') ## for debugging...
if c != '[' and c != ']' and state == 2:
if c != '|':
store += c
if c == ']' or c == '|':
state = 0
if store != "" and "File:" not in store and "Category:" not in store:
topic_list.append(store)
store = ""
##### Remove duplicates, just in case...
topic_list = list(dict.fromkeys(topic_list))
##### Create a list of article objects...
article_list = []
print('Here are the article titles read from the file:')
for x in topic_list:
article_list.append(Article(x))
for x in article_list:
print(x.eng_title)
##### ##### Create an array to store the articles' Cantonese titles
canto_title_list = []
for x in topic_list:
try:
site = pywikibot.Site()
page = pywikibot.Page(site, x)
item = pywikibot.ItemPage.fromPage(page)
# print(item) # This line shows the item's Wikidata Q addres
item_dict = item.get()
try:
canto_title = item_dict["labels"]['yue']
for y in article_list:
if y.eng_title == x:
y.DoesCantoWikiHaveIt = True
except:
try:
canto_title = item_dict["labels"]['zh-hant']
except:
try:
canto_title = item_dict["labels"]['en']
for y in article_list:
if y.eng_title == x:
y.TitleNeedsManualTranslated = True
## If an article does not have Kanji-based title to use,
## it needs manual translations...
except:
canto_title = 'N/A'
except:
canto_title = 'N/A'
canto_title_list.append(canto_title)
for z in article_list:
if z.eng_title == x:
z.canto_title = canto_title ## Put the title we found into canto_title...
print('\nJust attempted finding a Cantonese title for each article...')
print('The following articles are not present in the Cantonese wiki...')
print('Though they may be present in the Chinese wiki...')
for x in article_list:
if x.DoesCantoWikiHaveIt == False:
print(x.eng_title + ', ' + x.canto_title)
print('\nSome articles need a kanji title...')
for x in article_list:
if x.TitleNeedsManualTranslated == True:
print(x.eng_title)
##### Let's alter the navbox accordingly...
for x in article_list:
xplus = '[[' + x.eng_title + '|'
xplusb = '[[' + x.eng_title + ']'
cantoplus = '[[' + x.canto_title + '|'
cantoplusb = '[[' + x.canto_title + ']'
if (x.DoesCantoWikiHaveIt == False) and (x.TitleNeedsManualTranslated == False):
cantoplus = '<!--' + x.eng_title + '-->' + '[[' + x.canto_title + '|'
cantoplusb = '<!--' + x.eng_title + '-->' + '[[' + x.canto_title + ']'
try:
navbox = navbox.replace(xplus, cantoplus)
navbox = navbox.replace(xplusb, cantoplusb)
except:
print('error on ' + x + ', aka ' + canto_title)
##### End of fixing the navbox...
##### ##### For each article listed in topic_list, get its first xx words.
for x in topic_list:
try:
site = pywikibot.Site()
page = pywikibot.Page(site, x)
en_page_text = page.get()
storaga = "" # storaga is "stroage" plus "firaga".
for y in en_page_text:
storaga += y
if y == '\n' and storaga != '\n':
if '| ' in storaga or 'File:' in storaga or storaga[0] == '[' or storaga[0] == '|' or storaga[0] == '{':
storaga = ''
elif len(storaga) > 100: # take the stored texts only if they are long enough...
storaga = re.sub(r'\n', '', storaga)
for z in article_list: ## Find the article corresponding to this topic and copy the content...
if z.eng_title == x:
z.content = storaga
storaga = ''
break
except:
print('error - no article found, or something, for ' + x)
## for x in article_list: ## This for loop is again for debugging purposes
## print(x.content)
##### ##### For each piece of stored content, add the essential stuffs...
for x in article_list:
texti_temp = '{{translation|英維|tpercent=0}} \n' + x.content + '\n' + '<!--({{jpingauto|}};{{lang-en|}})。\n== 註釋 ==\n{{reflist|group=註}}-->'
x.content = texti_temp
x.content += '\n==睇埋==\n*[[' + canto_title_list[0] + ']]\n*[[]]\n*[[]]\n==攷==\n{{reflist}}\n{{stub}}\n[[Category:' + canto_title_list[0] + ']]'
##### ##### Let's write everything to the output file...
outputfile = Document()
## Put everything into the docx file..
try:
p = outputfile.add_paragraph(navbox)
except:
p = outputfile.add_paragraph(' ')
p = outputfile.add_paragraph('\n\n\n')
for x in article_list:
if x.DoesCantoWikiHaveIt == False:
try:
p = outputfile.add_paragraph(x.content)
except:
p = outputfile.add_paragraph(' ')
p = outputfile.add_paragraph('\n\n\n')
outputfile.save('wiki_autonavbox_output.docx')
print('\nThe program has finished running. Exiting in a few seconds...')
time.sleep(10)
@Z423x5c6、Deryck Chan、H78c67c、Shinjiman: 我去到「教個程式實際噉 edit wiki」嗰度唔係好搞得掂。有冇提議?
@Z423x5c6、Deryck Chan、H78c67c、Shinjiman: 啱啱 update 咗,因為仲未搞得掂教個程式 update,所以暫時將個程式設做「將要加落 wiki 嘅字冚唪唥 save 落一個 docx file 度」先。Dr. Greywolf (傾偈) 2022年7月15號 (五) 13:30 (UTC)
- 喺生成機器翻譯同埋貼上粵文維基之間要停一步都好嘅,有個機會人肉檢查下篇文嘅翻譯質素,修正然後先噏撈。翹仔 (傾偈) 2022年7月19號 (二) 08:31 (UTC)
- 或者可以放係某啲草稿版,咁大家都可以幫手檢查。——Z423X5C6(傾偈) 2022年7月19號 (二) 08:51 (UTC)
- @Z423x5c6、Deryck Chan: Z 弟條橋都 OK 喎。Dr. Greywolf (傾偈) 2022年7月19號 (二) 09:09 (UTC)
- Update:嘗試緊加個功能,教個程式自動將個 navbox 啲文嘅 title 換做粵文 + 喺撞到篇文要焗住用英文 title 嗰陣,自動問 user 想點譯。Dr. Greywolf (傾偈) 2022年7月19號 (二) 15:09 (UTC)
@Z423x5c6、Deryck Chan、H78c67c、Shinjiman: 查實呢個程式經已可以攞去用,不過我想整埋個程式識將個 navbox 啲文嘅 title 換做粵文。Dr. Greywolf (傾偈) 2022年7月20號 (三) 15:25 (UTC)
- 咁嘅話你直接整個page object嗰陣用個粵文title就得,pywikibot.Page(site, '粵文title')咁。——Z423X5C6(傾偈) 2022年7月21號 (四) 03:30 (UTC)
- @Z423x5c6: 我係話想教個程式將 navbox 段碼入面啲英文 title 變嗮粵文 title,等我可以一嘢 copy-and-paste 搞掂咗嗰個 navbox 落 wiki 度... Dr. Greywolf (傾偈) 2022年7月21號 (四) 15:05 (UTC)
@Z423x5c6、Deryck Chan、H78c67c、Shinjiman: 我啲 comment 夠唔夠清楚?Dr. Greywolf (傾偈) 2022年7月21號 (四) 15:35 (UTC)
@Deryck Chan: 講起嚟,呢個程式對第啲語言嘅維基嚟講有冇用呢?Dr. Greywolf (傾偈) 2022年7月21號 (四) 15:47 (UTC)
- 有用。而家個 Content Translate 架撐,一係唔譯,一做就要做晒成版。可以淨係做一兩段幾好。翹仔 (傾偈) 2022年8月9號 (二) 09:05 (UTC)
- @Deryck Chan: 好,噉隨便攞去畀第啲語言嘅維基用啦。Dr. Greywolf (傾偈) 2022年8月9號 (二) 09:14 (UTC)
- @Deryck Chan: 好唔好擺呢段 code 上 github 呢?Dr. Greywolf (傾偈) 2022年8月9號 (二) 09:19 (UTC)
24-July-2022
[編輯]Notes to self:打算 improve 隻程式,提升到去「可以俾其他維基人攞去用」嘅水平。Dr. Greywolf (傾偈) 2022年7月24號 (日) 08:48 (UTC)
個程式經已有咗初型。想加以下呢啲 functionality:
- 教個程式問用家啲得英維 title 嘅文嘅 title 點譯。
教個程式自動做機翻。Dr. Greywolf (傾偈) 2022年7月24號 (日) 09:01 (UTC)
啱啱又 update 咗段 code,係我低能。:P 一開始就應該用 OOP 嘅。Dr. Greywolf (傾偈) 2022年7月25號 (一) 14:13 (UTC)
2-August-2022
[編輯]@Deryck Chan、Z423x5c6、Shinjiman、Chaaak: 我又有個新諗頭。係噉嘅,我寫文嗰陣,好多時都會首先去睇吓英維同大英百科全書度嘅文,用嚟做參考。我諗緊呢個過程查實可以 automate?想像一個噉嘅程式:
- 攞一個題目做 input;
- Search 相應嘅英維頁,try 埋相應嘅大英百科全書或者 stanford encyclopedia of philosophy 頁;
- Foreach 呢啲頁,抽出嗰頁有嘅重要 keyword 同內容。
- 將啲內容砌做一份 document,俾我直接譯落去篇文度。Dr. Greywolf (傾偈) 2022年8月2號 (二) 14:19 (UTC)
Code 貼呢度,第一次更新。Dr. Greywolf (傾偈) 2022年8月10號 (三) 02:22 (UTC)
- Updated Dr. Greywolf (傾偈) 2022年8月10號 (三) 15:03 (UTC)
# -*- coding: utf-8 -*-
"""
Created on Thu Aug 4 22:51:58 2022
@author: Ralph
"""
import docx
from docx import Document
import time
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx
top_n = 5 # This parameter would limit the number of sentences you get...
# temp_file_name = 'python_testing_autosum.txt'
doc_file_name = 'to_be_summarised.docx'
doc_opened = docx.Document(doc_file_name)
stop_words = set(stopwords.words('english'))
summarized_text = []
def sentence_similarity(sent1, sent2):
sent1 = [w.lower() for w in sent1]
# print(sent1) # for debugging...
sent2 = [w.lower() for w in sent2]
# print(sent2) # for debugging...
all_words = list(set(sent1 + sent2)) # list out all the words in either sent1 or sent2
# print(all_words) # for debugging...
vector1 = [0] * len(all_words)
vector2 = [0] * len(all_words)
# print(vector1) # for debugging...
# print(vector2) # for debugging...
for w in sent1:
if w in stop_words:
continue
vector1[all_words.index(w)] += 1
for w in sent2:
if w in stop_words:
continue
vector2[all_words.index(w)] += 1
return 1 - cosine_distance(vector1, vector2)
def build_similarity_matrix(sentences):
similarity_matrix = np.zeros((len(sentences), len(sentences)))
# print(similarity_matrix) # for debugging...
for idx1 in range(len(sentences)):
for idx2 in range(len(sentences)):
if idx1 == idx2: #ignore if both are same sentences
continue
similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2])
print(similarity_matrix) # for debugging...
print('')
return similarity_matrix
##### Read the output file...
filedata = []
for para in doc_opened.paragraphs:
filedata.append(para.text)
#file = open(doc_file_name, "r") # if you want to use a txt file as input...
#filedata = file.readlines() #
print("Here is what I read from the txt file...\n")
print(filedata)
article = filedata[0].split(". ")
# print(article) # for debugging...
sentences = []
for s in article:
ss = s.replace("[^a-zA-Z]", " ").split(" ")
# print(ss) # for debugging...
sentences.append(ss)
print("\nFeeding the following into sentence similarity calcuation...\n")
print(sentences)
# print('\nTesting the sentence_similarity function...\n')
# print(sentence_similarity(sentences[0], sentences[1])) # This line is only for testing...
print('\nBuilding the Matrix and calculate each sentence\'s score...\n')
sentence_similarity_martix = build_similarity_matrix(sentences)
sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
scores = nx.pagerank(sentence_similarity_graph)
for idx in range(len(scores)):
print('Sentence ' + str(idx) + ': ' + str(scores[idx]))
ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
# print("Indexes of top ranked_sentence order are ", ranked_sentence)
for i in range(top_n):
try:
summarized_text.append(" ".join(ranked_sentence[i][1]))
except:
print('Error on Sentence ' + str(i))
print("Summarize Text: \n", ". ".join(summarized_text))
##### ##### Let's write everything to the output file...
outputfile = Document()
try:
p = outputfile.add_paragraph(". ".join(summarized_text))
except:
p = outputfile.add_paragraph(' ')
outputfile.save('wiki_autosummariser_output.docx')
print('\nThe program has finished running. Exiting in a few seconds...')
time.sleep(10)