Jump to content

Artificial Intelligence with n-gram(Python 2.7 (64 bits) Scripts)


fredreload

Recommended Posts

Description:

I wrote this AI after I left my previous company in Pou Chen. I did a little bit of web scraping with words frequency testing in the company so I came up with this AI after I left the company. Below are the videos and the Python scripts are included in the video as well as a link to Crunchyroll explaining its content posted by me.

You can:

Run the scripts and test them out for me to see what I can improve on. Yes the scripts are crude with no comments but I took 6 months(non continuous) working on them. So before you say it does not make sense or not working, sort of take some time to get used to the scripts and I will try and answer the questions here. You can also test the scripts with a different dictionary once you understand the program.

Part 1:

Part 2:

 

Link to comment
Share on other sites

1 hour ago, fredreload said:

the scripts are crude with no comments

Ok. 

1 hour ago, fredreload said:

see what I can improve on

Suggestion 1: Add comments and documentation to your scripts.

 

Note that Python 2.7 has reached end of life*, migrate to 3.x.

 

*) https://www.python.org/doc/sunset-python-2/

Edited by Ghideon
format
Link to comment
Share on other sites

17 hours ago, Ghideon said:

Suggestion 1: Add comments and documentation to your scripts.

That would be a good idea, I would add it during free time.

Quote

Note that Python 2.7 has reached end of life*, migrate to 3.x.

My scripts technically also runs on 3.x 64 bits Python, maybe some changes to the print statements

Link to comment
Share on other sites

15 minutes ago, fredreload said:

This Python script is too slow. I think I need to rewrite the program in C#, do you think that would give it a performance boost Ghideon?

The information provided is too limited, I will not try to make a prediction. In my experience performance is a result of many parameters; using another implementation language may or may not result in a required performance boost.  

Link to comment
Share on other sites

2 hours ago, Ghideon said:

The information provided is too limited, I will not try to make a prediction. In my experience performance is a result of many parameters; using another implementation language may or may not result in a required performance boost.  

I agree, I am unable to find a suitable disk based dictionary for c# = =, I am using shelvedb for this program. So I might have to use sqlite if I want to integrate it to a c# platform, but sqlite is pretty slow still.

Link to comment
Share on other sites

21 hours ago, fredreload said:

I agree, I am unable to find a suitable disk based dictionary for c# = =, I am using shelvedb for this program. So I might have to use sqlite if I want to integrate it to a c# platform, but sqlite is pretty slow still.

Information provided so far is not enough to comment on what kind of issue you are facing regarding performance; I do not have an opinion if specific products are suitable or not suitable. 

Link to comment
Share on other sites

20 hours ago, Ghideon said:

Information provided so far is not enough to comment on what kind of issue you are facing regarding performance; I do not have an opinion if specific products are suitable or not suitable. 

As you can see my shelve db file are some few GBs in size. Each dictionary consists of a phrase and a frequency and I would constantly update this frequency. So for sqlite I would have like 1 billion upsert statements. I tried to run 1 billion inserts, which is fast, but when it comes to update the frequency it takes forever. This is for if exist insert else update. I never tried upsert so I would not know if it is faster/slower.

Link to comment
Share on other sites

The ongoing flow of information makes memory distinguishers impossible the more we fracture time into Nano- and Femto-seconds . ... . .   . . .Background (echo) vibration is thus the distinguishing motor  . . . . .  .Because memory signals ought to go on a differing time-scale vibration . . . . 

Link to comment
Share on other sites

3 minutes ago, Prof Reza Sanaye said:

The ongoing flow of information makes memory distinguishers impossible the more we fracture time into Nano- and Femto-seconds . ... . .   . . .Background (echo) vibration is thus the distinguishing motor  . . . . .  .Because memory signals ought to go on a differing time-scale vibration . . . . 

And how is that related to the discussion in this tread?

Link to comment
Share on other sites

2 minutes ago, Prof Reza Sanaye said:

By turning the temporal Script (fractals) all but continual  . . . . .

Your text look like the output from some algorithm, based on Markov Chains(?), that generates random sentences. 

4 hours ago, fredreload said:

As you can see my shelve db file are some few GBs in size. Each dictionary consists of a phrase and a frequency and I would constantly update this frequency. So for sqlite I would have like 1 billion upsert statements. I tried to run 1 billion inserts, which is fast, but when it comes to update the frequency it takes forever. This is for if exist insert else update. I never tried upsert so I would not know if it is faster/slower.

This seem to require more time and effort than I am willing to provide at this time. 

Edited by Ghideon
grammar
Link to comment
Share on other sites

On 3/3/2021 at 6:05 PM, fredreload said:

As you can see my shelve db file are some few GBs in size.

Do you have SSD? Do you have NVMe? What is transfer of data during db access? How many GB of memory does your computer have? Try using virtual disk in memory to see whether there will be change in speed.

Quote

Each dictionary consists of a phrase and a frequency and I would constantly update this frequency. So for sqlite I would have like 1 billion upsert statements. I tried to run 1 billion inserts, which is fast, but when it comes to update the frequency it takes forever. This is for if exist insert else update. I never tried upsert so I would not know if it is faster/slower.

How are you storing, querying and updating db? Show SQL query string for them all.

You can try:

- calculate md5 (or so) of phrase text first. It will be hash code.

- phrase table. Use above hash as unique key together with each phrase text.

- frequency table. Use above hash code in second table with quantities / frequencies as integer.

Therefore update should be faster. Won't require adding or replacing entire string.

 

Alternatively don't store phrases as plain text. Have dictionary with words with unique indices. 4 bytes integer is enough to have 4.2 bln words. Then make phrase dictionary table. One with two columns for word indexes. Second table with three columns for word indexes. etc. in the future you will add more.

Edited by Sensei
Link to comment
Share on other sites

20 hours ago, Sensei said:

Do you have SSD? Do you have NVMe? What is transfer of data during db access? How many GB of memory does your computer have? Try using virtual disk in memory to see whether there will be change in speed.

How are you storing, querying and updating db? Show SQL query string for them all.

You can try:

- calculate md5 (or so) of phrase text first. It will be hash code.

- phrase table. Use above hash as unique key together with each phrase text.

- frequency table. Use above hash code in second table with quantities / frequencies as integer.

Therefore update should be faster. Won't require adding or replacing entire string.

 

Alternatively don't store phrases as plain text. Have dictionary with words with unique indices. 4 bytes integer is enough to have 4.2 bln words. Then make phrase dictionary table. One with two columns for word indexes. Second table with three columns for word indexes. etc. in the future you will add more.

That is a really good idea. First I would index all the words presented with a list of unique IDs. But the problem I am facing is, SQL is not a dictionary.

If I have do to "select * from db where string='123'" it would have a much faster run time with a dictionary for db["123"], if there is a way to combine this aspect of dictionary with the SQL database.

Edited by fredreload
Link to comment
Share on other sites

24 minutes ago, fredreload said:

That is a really good idea. First I would index all the words presented with a list of unique IDs. But the problem I am facing is, SQL is not a dictionary.

If I have do to "select * from db where string='123'" it would have a much faster run time with a dictionary for db["123"], if there is a way to combine this aspect of dictionary with the SQL database.

Make cache in memory. Check if word is present in dynamically allocated array or key-value pair associative array, if it is, increase usage counter of the entry. If it is not present, lookup the database and put the new entry in the cache. Have 1000, 10000, or so, the most used entries. From time to time flush cache of the least used entries.

The most frequently used words-entries-phrases will be cached at all times during execution of the script.

You can make separate caches for words, phrase with two words. three words. Each with user configurable max number of entries.

In OOP language you should just make class for cache which will cover entire database code.

Edited by Sensei
Link to comment
Share on other sites

1 hour ago, Sensei said:

Make cache in memory. Check if word is present in dynamically allocated array or key-value pair associative array, if it is, increase usage counter of the entry. If it is not present, lookup the database and put the new entry in the cache. Have 1000, 10000, or so, the most used entries. From time to time flush cache of the least used entries.

The most frequently used words-entries-phrases will be cached at all times during execution of the script.

You can make separate caches for words, phrase with two words. three words. Each with user configurable max number of entries.

In OOP language you should just make class for cache which will cover entire database code.

When I cache the memory it used up all 16GB of RAM, that is actually the first thing I tried, but my computer cannot handle that big of a cache(I only have 16GB of RAM), so I switched to a disk based dictionary. The idea about the dictionary is since every single dict key is assigned a unique formula so the search time it takes to get to any particular key is always O(1). I dunno about database though or if it could be converted to a dictionary based method to optimize the run time.

Link to comment
Share on other sites

Caching thousands the most used words in ASCII would take a few dozen kilobytes. Not MB. Not GB. But KB. In Unicode 2-4x more. You are caching to not have to lookup things like "I", "you", "it", etc. etc.

Link to comment
Share on other sites

19 hours ago, Sensei said:

Caching thousands the most used words in ASCII would take a few dozen kilobytes. Not MB. Not GB. But KB. In Unicode 2-4x more. You are caching to not have to lookup things like "I", "you", "it", etc. etc.

Ya, but I am not caching words, I am caching phrases in a group of 3

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.