fredreload Posted February 24, 2021 Share Posted February 24, 2021 Description: I wrote this AI after I left my previous company in Pou Chen. I did a little bit of web scraping with words frequency testing in the company so I came up with this AI after I left the company. Below are the videos and the Python scripts are included in the video as well as a link to Crunchyroll explaining its content posted by me. You can: Run the scripts and test them out for me to see what I can improve on. Yes the scripts are crude with no comments but I took 6 months(non continuous) working on them. So before you say it does not make sense or not working, sort of take some time to get used to the scripts and I will try and answer the questions here. You can also test the scripts with a different dictionary once you understand the program. Part 1: Part 2: Link to comment Share on other sites More sharing options...
Ghideon Posted February 24, 2021 Share Posted February 24, 2021 (edited) 1 hour ago, fredreload said: the scripts are crude with no comments Ok. 1 hour ago, fredreload said: see what I can improve on Suggestion 1: Add comments and documentation to your scripts. Note that Python 2.7 has reached end of life*, migrate to 3.x. *) https://www.python.org/doc/sunset-python-2/ Edited February 24, 2021 by Ghideon format Link to comment Share on other sites More sharing options...
fredreload Posted February 25, 2021 Author Share Posted February 25, 2021 17 hours ago, Ghideon said: Suggestion 1: Add comments and documentation to your scripts. That would be a good idea, I would add it during free time. Quote Note that Python 2.7 has reached end of life*, migrate to 3.x. My scripts technically also runs on 3.x 64 bits Python, maybe some changes to the print statements Link to comment Share on other sites More sharing options...
fredreload Posted March 1, 2021 Author Share Posted March 1, 2021 I modified the last part of the script, now it is more responsive and completed. Link to comment Share on other sites More sharing options...
fredreload Posted March 1, 2021 Author Share Posted March 1, 2021 (edited) This Python script is too slow. I think I need to rewrite the program in C#, do you think that would give it a performance boost Ghideon? Edited March 1, 2021 by fredreload Link to comment Share on other sites More sharing options...
Ghideon Posted March 1, 2021 Share Posted March 1, 2021 15 minutes ago, fredreload said: This Python script is too slow. I think I need to rewrite the program in C#, do you think that would give it a performance boost Ghideon? The information provided is too limited, I will not try to make a prediction. In my experience performance is a result of many parameters; using another implementation language may or may not result in a required performance boost. Link to comment Share on other sites More sharing options...
fredreload Posted March 1, 2021 Author Share Posted March 1, 2021 2 hours ago, Ghideon said: The information provided is too limited, I will not try to make a prediction. In my experience performance is a result of many parameters; using another implementation language may or may not result in a required performance boost. I agree, I am unable to find a suitable disk based dictionary for c# = =, I am using shelvedb for this program. So I might have to use sqlite if I want to integrate it to a c# platform, but sqlite is pretty slow still. Link to comment Share on other sites More sharing options...
Ghideon Posted March 2, 2021 Share Posted March 2, 2021 21 hours ago, fredreload said: I agree, I am unable to find a suitable disk based dictionary for c# = =, I am using shelvedb for this program. So I might have to use sqlite if I want to integrate it to a c# platform, but sqlite is pretty slow still. Information provided so far is not enough to comment on what kind of issue you are facing regarding performance; I do not have an opinion if specific products are suitable or not suitable. Link to comment Share on other sites More sharing options...
fredreload Posted March 3, 2021 Author Share Posted March 3, 2021 20 hours ago, Ghideon said: Information provided so far is not enough to comment on what kind of issue you are facing regarding performance; I do not have an opinion if specific products are suitable or not suitable. As you can see my shelve db file are some few GBs in size. Each dictionary consists of a phrase and a frequency and I would constantly update this frequency. So for sqlite I would have like 1 billion upsert statements. I tried to run 1 billion inserts, which is fast, but when it comes to update the frequency it takes forever. This is for if exist insert else update. I never tried upsert so I would not know if it is faster/slower. Link to comment Share on other sites More sharing options...
Prof Reza Sanaye Posted March 3, 2021 Share Posted March 3, 2021 The ongoing flow of information makes memory distinguishers impossible the more we fracture time into Nano- and Femto-seconds . ... . . . . .Background (echo) vibration is thus the distinguishing motor . . . . . .Because memory signals ought to go on a differing time-scale vibration . . . . Link to comment Share on other sites More sharing options...
Ghideon Posted March 3, 2021 Share Posted March 3, 2021 3 minutes ago, Prof Reza Sanaye said: The ongoing flow of information makes memory distinguishers impossible the more we fracture time into Nano- and Femto-seconds . ... . . . . .Background (echo) vibration is thus the distinguishing motor . . . . . .Because memory signals ought to go on a differing time-scale vibration . . . . And how is that related to the discussion in this tread? Link to comment Share on other sites More sharing options...
Prof Reza Sanaye Posted March 3, 2021 Share Posted March 3, 2021 3 minutes ago, Ghideon said: And how is that related to the discussion in this tread? By turning the temporal Script (fractals) all but continual . . . . . Link to comment Share on other sites More sharing options...
Ghideon Posted March 3, 2021 Share Posted March 3, 2021 (edited) 2 minutes ago, Prof Reza Sanaye said: By turning the temporal Script (fractals) all but continual . . . . . Your text look like the output from some algorithm, based on Markov Chains(?), that generates random sentences. 4 hours ago, fredreload said: As you can see my shelve db file are some few GBs in size. Each dictionary consists of a phrase and a frequency and I would constantly update this frequency. So for sqlite I would have like 1 billion upsert statements. I tried to run 1 billion inserts, which is fast, but when it comes to update the frequency it takes forever. This is for if exist insert else update. I never tried upsert so I would not know if it is faster/slower. This seem to require more time and effort than I am willing to provide at this time. Edited March 3, 2021 by Ghideon grammar Link to comment Share on other sites More sharing options...
Sensei Posted March 4, 2021 Share Posted March 4, 2021 (edited) On 3/3/2021 at 6:05 PM, fredreload said: As you can see my shelve db file are some few GBs in size. Do you have SSD? Do you have NVMe? What is transfer of data during db access? How many GB of memory does your computer have? Try using virtual disk in memory to see whether there will be change in speed. Quote Each dictionary consists of a phrase and a frequency and I would constantly update this frequency. So for sqlite I would have like 1 billion upsert statements. I tried to run 1 billion inserts, which is fast, but when it comes to update the frequency it takes forever. This is for if exist insert else update. I never tried upsert so I would not know if it is faster/slower. How are you storing, querying and updating db? Show SQL query string for them all. You can try: - calculate md5 (or so) of phrase text first. It will be hash code. - phrase table. Use above hash as unique key together with each phrase text. - frequency table. Use above hash code in second table with quantities / frequencies as integer. Therefore update should be faster. Won't require adding or replacing entire string. Alternatively don't store phrases as plain text. Have dictionary with words with unique indices. 4 bytes integer is enough to have 4.2 bln words. Then make phrase dictionary table. One with two columns for word indexes. Second table with three columns for word indexes. etc. in the future you will add more. Edited March 4, 2021 by Sensei 1 Link to comment Share on other sites More sharing options...
fredreload Posted March 5, 2021 Author Share Posted March 5, 2021 (edited) 20 hours ago, Sensei said: Do you have SSD? Do you have NVMe? What is transfer of data during db access? How many GB of memory does your computer have? Try using virtual disk in memory to see whether there will be change in speed. How are you storing, querying and updating db? Show SQL query string for them all. You can try: - calculate md5 (or so) of phrase text first. It will be hash code. - phrase table. Use above hash as unique key together with each phrase text. - frequency table. Use above hash code in second table with quantities / frequencies as integer. Therefore update should be faster. Won't require adding or replacing entire string. Alternatively don't store phrases as plain text. Have dictionary with words with unique indices. 4 bytes integer is enough to have 4.2 bln words. Then make phrase dictionary table. One with two columns for word indexes. Second table with three columns for word indexes. etc. in the future you will add more. That is a really good idea. First I would index all the words presented with a list of unique IDs. But the problem I am facing is, SQL is not a dictionary. If I have do to "select * from db where string='123'" it would have a much faster run time with a dictionary for db["123"], if there is a way to combine this aspect of dictionary with the SQL database. Edited March 5, 2021 by fredreload Link to comment Share on other sites More sharing options...
Sensei Posted March 5, 2021 Share Posted March 5, 2021 (edited) 24 minutes ago, fredreload said: That is a really good idea. First I would index all the words presented with a list of unique IDs. But the problem I am facing is, SQL is not a dictionary. If I have do to "select * from db where string='123'" it would have a much faster run time with a dictionary for db["123"], if there is a way to combine this aspect of dictionary with the SQL database. Make cache in memory. Check if word is present in dynamically allocated array or key-value pair associative array, if it is, increase usage counter of the entry. If it is not present, lookup the database and put the new entry in the cache. Have 1000, 10000, or so, the most used entries. From time to time flush cache of the least used entries. The most frequently used words-entries-phrases will be cached at all times during execution of the script. You can make separate caches for words, phrase with two words. three words. Each with user configurable max number of entries. In OOP language you should just make class for cache which will cover entire database code. Edited March 5, 2021 by Sensei Link to comment Share on other sites More sharing options...
fredreload Posted March 5, 2021 Author Share Posted March 5, 2021 1 hour ago, Sensei said: Make cache in memory. Check if word is present in dynamically allocated array or key-value pair associative array, if it is, increase usage counter of the entry. If it is not present, lookup the database and put the new entry in the cache. Have 1000, 10000, or so, the most used entries. From time to time flush cache of the least used entries. The most frequently used words-entries-phrases will be cached at all times during execution of the script. You can make separate caches for words, phrase with two words. three words. Each with user configurable max number of entries. In OOP language you should just make class for cache which will cover entire database code. When I cache the memory it used up all 16GB of RAM, that is actually the first thing I tried, but my computer cannot handle that big of a cache(I only have 16GB of RAM), so I switched to a disk based dictionary. The idea about the dictionary is since every single dict key is assigned a unique formula so the search time it takes to get to any particular key is always O(1). I dunno about database though or if it could be converted to a dictionary based method to optimize the run time. Link to comment Share on other sites More sharing options...
Sensei Posted March 5, 2021 Share Posted March 5, 2021 Caching thousands the most used words in ASCII would take a few dozen kilobytes. Not MB. Not GB. But KB. In Unicode 2-4x more. You are caching to not have to lookup things like "I", "you", "it", etc. etc. Link to comment Share on other sites More sharing options...
fredreload Posted March 6, 2021 Author Share Posted March 6, 2021 19 hours ago, Sensei said: Caching thousands the most used words in ASCII would take a few dozen kilobytes. Not MB. Not GB. But KB. In Unicode 2-4x more. You are caching to not have to lookup things like "I", "you", "it", etc. etc. Ya, but I am not caching words, I am caching phrases in a group of 3 Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now