User:Dr. Greywolf/Bot

原先個 bot 嘅 pseudocode：Dr. Greywolf (傾偈) 2022年6月27號 (一) 15:00 (UTC)

 攞一個英維 navbox 模嘅 source code；
 將每段用 [[ ]] 框住咗嘅文字，整做一個 array 記住；
 Foreach 個 array 入面嘅數據：
   去英維摷段字對應嘅文；
   去嗰篇文嘅中文版，攞香港版 title；
   去粵典搵個漢字 title 嘅粵拼發音；
   攞英維篇文頭嗰 5 句句子；
   將拃句子用 microsoft 機翻翻咗佢；([1])  
   用上述嘅資訊同篇文開 stub；

28-June-2022

@Z423x5c6、Deryck Chan、H78c67c、Shinjiman：啱啱寫咗部有限狀態機：

Input - 一個 navbox template 嘅 wiki 碼；
Output - 一個 array，store 住拃文嘅 title；

~~我啱啱攞咗英維個 en:Template:Natural language processing 做測試，段 code 一如所料會俾出以下嘅 output:~~

Dr. Greywolf (傾偈) 2022年6月27號 (一) 14:53 (UTC)

30-June-2022

@Z423x5c6、Deryck Chan、H78c67c、Shinjiman：啱啱加咗以下嘅碼：

Input - 一個 array，store 住拃文嘅英文 title；
Output - 一個 array，store 住拃文嘅預想粵文 title；

...

上述段碼做到「攞一個 array 做 input，個 array 列出嗮要搵嗰啲英維文嘅 title，再俾出個新 array，個新 array 列出啲文嘅預想粵維 title」：

我攞咗英維個樂理 navbox 嘅維基碼做 input；
Output 一如所料係 ['樂理', '音樂美學', '音樂分析', '音樂元素', '作曲', '音樂嘅定義', 'Music and mathematics', '音樂學', '音樂哲學', '音樂心理學', '音樂集合論', '調音', 'list of music theorists']；

Dr. Greywolf (傾偈) 2022年6月29號 (三) 16:16 (UTC)

加咗啲 try 落去，so that 個 template 有紅拎都唔會搞到個程式行唔到。Dr. Greywolf (傾偈) 2022年6月30號 (四) 07:17 (UTC)

下一步嘅 plan：

教個程式「攞英維篇文頭嗰 5 句句子」；Dr. Greywolf (傾偈) 2022年6月30號 (四) 07:19 (UTC)
「教個程式搵粵拼」呢 part 唔係必要，所以延後處理，整好嗮個程式嘅其餘部份先搞。Dr. Greywolf (傾偈) 2022年6月30號 (四) 07:19 (UTC)

2-July-2022

@Z423x5c6、Deryck Chan、H78c67c、Shinjiman： Update：

Input - 一個 array，store 住拃文嘅英文 title；
Output - Foreach 文，攞篇文頭嗰一部份內容；

...

做咗測試，用英維樂理 navbox 做測試，mostly working correctly。Dr. Greywolf (傾偈) 2022年7月1號 (五) 15:48 (UTC)

下一步嘅 plan：做機翻 ~~ Dr. Greywolf (傾偈) 2022年7月1號 (五) 15:46 (UTC)

@Z423x5c6、Deryck Chan、H78c67c、Shinjiman：其他維基嘅人會唔會對呢件架生有興趣？Dr. Greywolf (傾偈) 2022年7月1號 (五) 16:04 (UTC)

喺度諗緊上面第 14 行（if '| ' in storaga or 'File:' in storaga ...）嗰 part，好唔好改用人工神經網絡。Dr. Greywolf (傾偈) 2022年7月2號 (六) 01:52 (UTC)

其實可以用mw:Extension:TextExtracts#API（例：https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exchars=500&explaintext=1&titles=Artificial+intelligence ）攞文章extract。H78c67c·傾偈 2022年7月2號 (六) 02:07 (UTC)

@H78c67c： OK thank you。我低咗能添。XD Dr. Greywolf (傾偈) 2022年7月2號 (六) 02:15 (UTC)

3-July-2022

@Z423x5c6、Deryck Chan、H78c67c、Shinjiman： Update：

Input - 一個 array，store 住拃文嘅開頭；
Output - 同個 array 入面每嚿內容加返「睇埋」等嘅必要嘢；

...

下一步嘅 plan：教個程式開文 + 將啲內容貼上去。 Dr. Greywolf (傾偈) 2022年7月2號 (六) 15:52 (UTC)

10-July-2022

Full code 擺呢度

15-July-2022 Update Dr. Greywolf (傾偈) 2022年7月15號 (五) 13:30 (UTC)

Updated on 22-July-2022 Dr. Greywolf (傾偈) 2022年7月22號 (五) 07:34 (UTC)

Final update on 26-July-2022 Dr. Greywolf (傾偈) 2022年7月26號 (二) 02:32 (UTC)

"""
Created on Sat Jun 25 00:31:59 2022
@author: No one
"""
import pywikibot
import re
import docx
from docx import Document
import time

##### Declare the Article class
class Article:
    eng_title = ''
    canto_title = ''
    content = ''
    DoesCantoWikiHaveIt = False # This will be true if the Cantonese wiki has this article...
    TitleNeedsManualTranslated = False # This will be true if no Kanji-based titles are available...
    def __init__(self, eng_tit):
        self.eng_title = eng_tit

state = 0
# state basically represents the number of '[' read,
# state is used only when reading the input file...

filename = "python_testing.txt"
# This above line stores the txt file that holds the input wiki code

topic_list = []
store = ""

text_file = open("python_testing.txt", "r")
navbox = text_file.read() # This one will store the navbox to be created.

##### ##### Read the input and create the array...
with open(filename) as f:
  while True:
    c = f.read(1)   
    if not c:
      # print(topic_list)
      break
  
    if c == '[' and state == 0:
      state = 1
      #print('1') ## for debugging...
      
    if c == '[' and state == 1:
      state = 2 # if state == 2, then it is time to start reading...
      #print('2') ## for debugging...
      
    if c != '[' and c != ']' and state == 2:
      if c != '|':
        store += c
            
    if c == ']' or c == '|':
      state = 0
      if store != "" and "File:" not in store and "Category:" not in store:
        topic_list.append(store)
      store = ""

##### Remove duplicates, just in case...
topic_list = list(dict.fromkeys(topic_list))

##### Create a list of article objects...
article_list = []

print('Here are the article titles read from the file:')    
for x in topic_list:
    article_list.append(Article(x))

for x in article_list:
    print(x.eng_title)
      
##### ##### Create an array to store the articles' Cantonese titles      
canto_title_list = []
     
for x in topic_list:
    try:
        site = pywikibot.Site()
        page = pywikibot.Page(site, x)
        item = pywikibot.ItemPage.fromPage(page)
        # print(item) # This line shows the item's Wikidata Q addres
        item_dict = item.get()
        
        try:
            canto_title = item_dict["labels"]['yue']            
            for y in article_list:
                if y.eng_title == x:
                    y.DoesCantoWikiHaveIt = True
        except:
            try:
                canto_title = item_dict["labels"]['zh-hant']
            except:
                try:
                    canto_title = item_dict["labels"]['en']
                    for y in article_list:
                        if y.eng_title == x:
                            y.TitleNeedsManualTranslated = True
                        ## If an article does not have Kanji-based title to use,
                        ## it needs manual translations...
                except:
                    canto_title = 'N/A'                    
    except:
        canto_title = 'N/A'
        
    canto_title_list.append(canto_title)
    for z in article_list:
        if z.eng_title == x:
            z.canto_title = canto_title ## Put the title we found into canto_title...
       
print('\nJust attempted finding a Cantonese title for each article...')
print('The following articles are not present in the Cantonese wiki...')
print('Though they may be present in the Chinese wiki...')
for x in article_list:
    if x.DoesCantoWikiHaveIt == False:
        print(x.eng_title + ', ' + x.canto_title)

print('\nSome articles need a kanji title...')
for x in article_list:
    if x.TitleNeedsManualTranslated == True:
        print(x.eng_title)
 
##### Let's alter the navbox accordingly...
for x in article_list:    
    xplus = '[[' + x.eng_title + '|'
    xplusb = '[[' + x.eng_title + ']'
    
    cantoplus = '[[' + x.canto_title + '|'
    cantoplusb = '[[' + x.canto_title + ']'
    
    if (x.DoesCantoWikiHaveIt == False) and (x.TitleNeedsManualTranslated == False):
        cantoplus = '<!--' + x.eng_title + '-->' + '[[' + x.canto_title + '|'
        cantoplusb = '<!--' + x.eng_title + '-->' + '[[' + x.canto_title + ']'
    
    try:
        navbox = navbox.replace(xplus, cantoplus)
        navbox = navbox.replace(xplusb, cantoplusb)
    except:
        print('error on ' + x + ', aka ' + canto_title)
    
##### End of fixing the navbox...

##### ##### For each article listed in topic_list, get its first xx words.
for x in topic_list:
    try:
        site = pywikibot.Site()
        page = pywikibot.Page(site, x)
        en_page_text = page.get()
        storaga = "" # storaga is "stroage" plus "firaga".
        
        for y in en_page_text:
            storaga += y
            if y == '\n' and storaga != '\n':
                if '| ' in storaga or 'File:' in storaga or storaga[0] == '[' or storaga[0] == '|' or storaga[0] == '{':
                    storaga = ''
                elif len(storaga) > 100: # take the stored texts only if they are long enough...
                    storaga = re.sub(r'\n', '', storaga)                    
                    for z in article_list: ## Find the article corresponding to this topic and copy the content...
                        if z.eng_title == x:
                            z.content = storaga
                        
                    storaga = ''
                    break
    except:
        print('error - no article found, or something, for ' + x)
        
## for x in article_list: ## This for loop is again for debugging purposes
##    print(x.content)

##### ##### For each piece of stored content, add the essential stuffs...
for x in article_list:
    texti_temp = '{{translation|英維|tpercent=0}} \n' + x.content + '\n' + '<!--（{{jpingauto|}}；{{lang-en|}}）。\n== 註釋 ==\n{{reflist|group=註}}-->'
    x.content = texti_temp
    x.content += '\n==睇埋==\n*[[' + canto_title_list[0] + ']]\n*[[]]\n*[[]]\n==攷==\n{{reflist}}\n{{stub}}\n[[Category:' + canto_title_list[0] + ']]'

##### ##### Let's write everything to the output file...
outputfile = Document()

## Put everything into the docx file..
try:
    p = outputfile.add_paragraph(navbox)
except:
    p = outputfile.add_paragraph(' ')
    
p = outputfile.add_paragraph('\n\n\n')

for x in article_list:
    if x.DoesCantoWikiHaveIt == False:
        try:
            p = outputfile.add_paragraph(x.content)
        except:
            p = outputfile.add_paragraph(' ')
        p = outputfile.add_paragraph('\n\n\n')

outputfile.save('wiki_autonavbox_output.docx')

print('\nThe program has finished running. Exiting in a few seconds...')
time.sleep(10)

@Z423x5c6、Deryck Chan、H78c67c、Shinjiman：我去到「教個程式實際噉 edit wiki」嗰度唔係好搞得掂。有冇提議？

@Z423x5c6、Deryck Chan、H78c67c、Shinjiman：啱啱 update 咗，因為仲未搞得掂教個程式 update，所以暫時將個程式設做「將要加落 wiki 嘅字冚唪唥 save 落一個 docx file 度」先。Dr. Greywolf (傾偈) 2022年7月15號 (五) 13:30 (UTC)

喺生成機器翻譯同埋貼上粵文維基之間要停一步都好嘅，有個機會人肉檢查下篇文嘅翻譯質素，修正然後先噏撈。翹仔 (傾偈) 2022年7月19號 (二) 08:31 (UTC)

或者可以放係某啲草稿版，咁大家都可以幫手檢查。——Z423X5C6（傾偈） 2022年7月19號 (二) 08:51 (UTC)

@Z423x5c6、Deryck Chan： Z 弟條橋都 OK 喎。Dr. Greywolf (傾偈) 2022年7月19號 (二) 09:09 (UTC)

Update：嘗試緊加個功能，教個程式自動將個 navbox 啲文嘅 title 換做粵文 + 喺撞到篇文要焗住用英文 title 嗰陣，自動問 user 想點譯。Dr. Greywolf (傾偈) 2022年7月19號 (二) 15:09 (UTC)

@Z423x5c6、Deryck Chan、H78c67c、Shinjiman：查實呢個程式經已可以攞去用，不過我想整埋個程式識將個 navbox 啲文嘅 title 換做粵文。Dr. Greywolf (傾偈) 2022年7月20號 (三) 15:25 (UTC)

咁嘅話你直接整個page object嗰陣用個粵文title就得，pywikibot.Page(site, '粵文title')咁。——Z423X5C6（傾偈） 2022年7月21號 (四) 03:30 (UTC)

@Z423x5c6：我係話想教個程式將 navbox 段碼入面啲英文 title 變嗮粵文 title，等我可以一嘢 copy-and-paste 搞掂咗嗰個 navbox 落 wiki 度... Dr. Greywolf (傾偈) 2022年7月21號 (四) 15:05 (UTC)

@Z423x5c6、Deryck Chan、H78c67c、Shinjiman：我啲 comment 夠唔夠清楚？Dr. Greywolf (傾偈) 2022年7月21號 (四) 15:35 (UTC)

@Deryck Chan：講起嚟，呢個程式對第啲語言嘅維基嚟講有冇用呢？Dr. Greywolf (傾偈) 2022年7月21號 (四) 15:47 (UTC)

有用。而家個 Content Translate 架撐，一係唔譯，一做就要做晒成版。可以淨係做一兩段幾好。翹仔 (傾偈) 2022年8月9號 (二) 09:05 (UTC)

@Deryck Chan：好，噉隨便攞去畀第啲語言嘅維基用啦。Dr. Greywolf (傾偈) 2022年8月9號 (二) 09:14 (UTC)

@Deryck Chan：好唔好擺呢段 code 上 github 呢？Dr. Greywolf (傾偈) 2022年8月9號 (二) 09:19 (UTC)

24-July-2022

Notes to self：打算 improve 隻程式，提升到去「可以俾其他維基人攞去用」嘅水平。Dr. Greywolf (傾偈) 2022年7月24號 (日) 08:48 (UTC)

個程式經已有咗初型。想加以下呢啲 functionality：

教個程式問用家啲得英維 title 嘅文嘅 title 點譯。
~~教個程式自動做機翻。~~Dr. Greywolf (傾偈) 2022年7月24號 (日) 09:01 (UTC)

啱啱又 update 咗段 code，係我低能。:P 一開始就應該用 OOP 嘅。Dr. Greywolf (傾偈) 2022年7月25號 (一) 14:13 (UTC)

2-August-2022

@Deryck Chan、Z423x5c6、Shinjiman、Chaaak：我又有個新諗頭。係噉嘅，我寫文嗰陣，好多時都會首先去睇吓英維同大英百科全書度嘅文，用嚟做參考。我諗緊呢個過程查實可以 automate？想像一個噉嘅程式：

攞一個題目做 input；
Search 相應嘅英維頁，try 埋相應嘅大英百科全書或者 stanford encyclopedia of philosophy 頁；
Foreach 呢啲頁，抽出嗰頁有嘅重要 keyword 同內容。
將啲內容砌做一份 document，俾我直接譯落去篇文度。Dr. Greywolf (傾偈) 2022年8月2號 (二) 14:19 (UTC)

Code 貼呢度，第一次更新。Dr. Greywolf (傾偈) 2022年8月10號 (三) 02:22 (UTC)

@Deryck Chan、Z423x5c6、Shinjiman、Chaaak：

Updated Dr. Greywolf (傾偈) 2022年8月10號 (三) 15:03 (UTC)

# -*- coding: utf-8 -*-
"""
Created on Thu Aug  4 22:51:58 2022

@author: Ralph
"""
import docx
from docx import Document
import time

from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx

top_n = 5 # This parameter would limit the number of sentences you get...

# temp_file_name = 'python_testing_autosum.txt'
doc_file_name = 'to_be_summarised.docx'
doc_opened = docx.Document(doc_file_name)

stop_words = set(stopwords.words('english'))
summarized_text = []

def sentence_similarity(sent1, sent2):
    sent1 = [w.lower() for w in sent1]
    # print(sent1) # for debugging...
    sent2 = [w.lower() for w in sent2]
    # print(sent2) # for debugging...
    all_words = list(set(sent1 + sent2)) # list out all the words in either sent1 or sent2
    # print(all_words) # for debugging...
    
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
    # print(vector1) # for debugging...
    # print(vector2) # for debugging...
    
    for w in sent1:
        if w in stop_words:
            continue
        vector1[all_words.index(w)] += 1
        
    for w in sent2:
        if w in stop_words:
            continue
        vector2[all_words.index(w)] += 1
        
    return 1 - cosine_distance(vector1, vector2) 

def build_similarity_matrix(sentences):
    similarity_matrix = np.zeros((len(sentences), len(sentences)))
    # print(similarity_matrix) # for debugging...
    
    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 == idx2: #ignore if both are same sentences
                continue
            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2])
            
    print(similarity_matrix) # for debugging...
    print('')
    return similarity_matrix

##### Read the output file...
filedata = []
for para in doc_opened.paragraphs:
        filedata.append(para.text)

#file = open(doc_file_name, "r") # if you want to use a txt file as input...
#filedata = file.readlines() #

print("Here is what I read from the txt file...\n")
print(filedata)

article = filedata[0].split(". ")
# print(article) # for debugging...

sentences = []

for s in article:
    ss = s.replace("[^a-zA-Z]", " ").split(" ")
    # print(ss) # for debugging...
    sentences.append(ss)

print("\nFeeding the following into sentence similarity calcuation...\n")
print(sentences)

# print('\nTesting the sentence_similarity function...\n')
# print(sentence_similarity(sentences[0], sentences[1])) # This line is only for testing...

print('\nBuilding the Matrix and calculate each sentence\'s score...\n')
sentence_similarity_martix = build_similarity_matrix(sentences)

sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
scores = nx.pagerank(sentence_similarity_graph)

for idx in range(len(scores)):
    print('Sentence ' + str(idx) + ': ' + str(scores[idx]))

ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True) 
# print("Indexes of top ranked_sentence order are ", ranked_sentence)

for i in range(top_n):
    try:
        summarized_text.append(" ".join(ranked_sentence[i][1]))
    except:
        print('Error on Sentence ' + str(i))

print("Summarize Text: \n", ". ".join(summarized_text))

##### ##### Let's write everything to the output file...
outputfile = Document()

try:
    p = outputfile.add_paragraph(". ".join(summarized_text))
except:
    p = outputfile.add_paragraph(' ')
    
outputfile.save('wiki_autosummariser_output.docx')
    
print('\nThe program has finished running. Exiting in a few seconds...')
time.sleep(10)