referencing a subhash for generalized ngram counting

braver · Nov 13, 2007

Greetings: I wonder how does one uses single-name variables to refer
to nested sunhashes (subdictionaries). Here's an example:

In [41]: orig = { 'abra':{'foo':7, 'bar':9}, 'ca':{}, 'dabra':{'baz':
4} }

In [42]: orig
Out[42]: {'abra': {'bar': 9, 'foo': 7}, 'ca': {}, 'dabra': {'baz': 4}}

In [43]: h = orig['ca']

In [44]: h = { 'adanac':69 }

In [45]: h
Out[45]: {'adanac': 69}

In [46]: orig
Out[46]: {'abra': {'bar': 9, 'foo': 7}, 'ca': {}, 'dabra': {'baz': 4}}

I want to change orig['ca'], which is determined somewhere else in a
program's logic, where subhashes are referred to as h -- e.g., for x
in orig: ... . But assigning to h doesn't change orig.

The real-life motivation for this is n-gram counting. Say you want to
maintain a hash for bigrams. For each two subsequent words a, b in a
text, you do

bigram_count[a] += 1

-- notice you do want to have nested subhashes as it decreases memory
usage dramatically.

In order to generalize this to N-grammity, you want to do something
like,

h = bigram_count
# iterating over i, not word, to notice the last i
for i in range(len(ngram):
word = ngram
if word not in h:
if i < N:
h[word] = {}
else:
h[word] = 0
h = h[word]
h += 1

-- doesn't work and is just a sketch; also, if at any level we get an
empty subhash, we can short-circuit vivify all remaining levels and
add 1 in the lowest, count, level.

Yet since names are not exactly references, something else is needed
for generalized ngram multi-level counting hash -- what?

Cheers,
Alexy

Marc 'BlackJack' Rintsch · Nov 13, 2007

Greetings: I wonder how does one uses single-name variables to refer
to nested sunhashes (subdictionaries). Here's an example:

That's possible and you do it in your example.

In [41]: orig = { 'abra':{'foo':7, 'bar':9}, 'ca':{}, 'dabra':{'baz':
4} }

In [42]: orig
Out[42]: {'abra': {'bar': 9, 'foo': 7}, 'ca': {}, 'dabra': {'baz': 4}}

In [43]: h = orig['ca']

Here you are getting the reference to the empty dictionary and bind it to
the name `h`.

In [44]: h = { 'adanac':69 }

In [45]: h
Out[45]: {'adanac': 69}

In [46]: orig
Out[46]: {'abra': {'bar': 9, 'foo': 7}, 'ca': {}, 'dabra': {'baz': 4}}

I want to change orig['ca'], which is determined somewhere else in a
program's logic, where subhashes are referred to as h -- e.g., for x
in orig: ... . But assigning to h doesn't change orig.

Correct. Changing the dictionary bound to `h` changes it::

h = orig['ca']
h['adanac'] = 69

Yet since names are not exactly references, something else is needed
for generalized ngram multi-level counting hash -- what?

Names *are* implemented as references to objects, but binding the name to
a different object has no effect on the object bound to that name before.

Ciao,
Marc 'BlackJack' Rintsch

braver · Nov 13, 2007

Here's a working version of the ngram counter with nested dict, wonder
how it can be improved!

lines = ["abra ca dabra",
"abra ca shvabra",
"abra movich roman",
"abra ca dabra",
"a bra cadadra"]

ngrams = [x.split() for x in lines]

N = 3
N1 = N-1

orig = {}

for ngram in ngrams:
h = orig
# iterating over i, not word, to notice the last i
for i in range(N):
word = ngram
if word not in h:
if i < N1: # (*)
h[word] = {}
else:
h[word] = 0
if i < N1:
h = h[word]
print i, h

h[word] += 1

print orig

-- e.g., perhaps we could do short-circuit vivification to the end in
(*)?

Cheers,
Alexy

Scott David Daniels · Nov 14, 2007

braver said:
...
The real-life motivation for this is n-gram counting. Say you want to
maintain a hash for bigrams. For each two subsequent words a, b in a
text, you do
bigram_count[a] += 1

This application is easily handed with tuples as keys.

bigrams = {}
src = iter(source)
lag = src.next()
for current in src:
bigrams[lag, current] = bigrams.get((lag, current), 0) + 1
lag = current

But if you really want nested:

bigrams = {}
src = iter(source)
lag = src.next()
for current in src:
count = bigrams.setdefault(lag, {}).get(current, 0)
bigrams[lag][current] = count + 1
lag = current

-Scott David Daniels
(e-mail address removed)

Padding strings for a clean visual print out...	5	Dec 23, 2023
Minimum Total Difficulty	0	Nov 15, 2023
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
[SUMMARY] Counting Cards (#152)	3	Jan 17, 2008
print header for output	0	Jun 19, 2011
ISO dict => xml converter	3	Jun 20, 2008
MemoryError in data conversion	10	Apr 14, 2014
large array in a single line	5	May 26, 2009

referencing a subhash for generalized ngram counting

braver

Marc 'BlackJack' Rintsch

braver

Scott David Daniels

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads