referencing a subhash for generalized ngram counting

B

braver

Greetings: I wonder how does one uses single-name variables to refer
to nested sunhashes (subdictionaries). Here's an example:

In [41]: orig = { 'abra':{'foo':7, 'bar':9}, 'ca':{}, 'dabra':{'baz':
4} }

In [42]: orig
Out[42]: {'abra': {'bar': 9, 'foo': 7}, 'ca': {}, 'dabra': {'baz': 4}}

In [43]: h = orig['ca']

In [44]: h = { 'adanac':69 }

In [45]: h
Out[45]: {'adanac': 69}

In [46]: orig
Out[46]: {'abra': {'bar': 9, 'foo': 7}, 'ca': {}, 'dabra': {'baz': 4}}

I want to change orig['ca'], which is determined somewhere else in a
program's logic, where subhashes are referred to as h -- e.g., for x
in orig: ... . But assigning to h doesn't change orig.

The real-life motivation for this is n-gram counting. Say you want to
maintain a hash for bigrams. For each two subsequent words a, b in a
text, you do

bigram_count[a] += 1

-- notice you do want to have nested subhashes as it decreases memory
usage dramatically.

In order to generalize this to N-grammity, you want to do something
like,

h = bigram_count
# iterating over i, not word, to notice the last i
for i in range(len(ngram):
word = ngram
if word not in h:
if i < N:
h[word] = {}
else:
h[word] = 0
h = h[word]
h += 1

-- doesn't work and is just a sketch; also, if at any level we get an
empty subhash, we can short-circuit vivify all remaining levels and
add 1 in the lowest, count, level.

Yet since names are not exactly references, something else is needed
for generalized ngram multi-level counting hash -- what?

Cheers,
Alexy
 
M

Marc 'BlackJack' Rintsch

Greetings: I wonder how does one uses single-name variables to refer
to nested sunhashes (subdictionaries). Here's an example:

That's possible and you do it in your example.
In [41]: orig = { 'abra':{'foo':7, 'bar':9}, 'ca':{}, 'dabra':{'baz':
4} }

In [42]: orig
Out[42]: {'abra': {'bar': 9, 'foo': 7}, 'ca': {}, 'dabra': {'baz': 4}}

In [43]: h = orig['ca']

Here you are getting the reference to the empty dictionary and bind it to
the name `h`.
In [44]: h = { 'adanac':69 }

In [45]: h
Out[45]: {'adanac': 69}

In [46]: orig
Out[46]: {'abra': {'bar': 9, 'foo': 7}, 'ca': {}, 'dabra': {'baz': 4}}

I want to change orig['ca'], which is determined somewhere else in a
program's logic, where subhashes are referred to as h -- e.g., for x
in orig: ... . But assigning to h doesn't change orig.

Correct. Changing the dictionary bound to `h` changes it::

h = orig['ca']
h['adanac'] = 69
Yet since names are not exactly references, something else is needed
for generalized ngram multi-level counting hash -- what?

Names *are* implemented as references to objects, but binding the name to
a different object has no effect on the object bound to that name before.

Ciao,
Marc 'BlackJack' Rintsch
 
B

braver

Here's a working version of the ngram counter with nested dict, wonder
how it can be improved!

lines = ["abra ca dabra",
"abra ca shvabra",
"abra movich roman",
"abra ca dabra",
"a bra cadadra"]

ngrams = [x.split() for x in lines]

N = 3
N1 = N-1

orig = {}

for ngram in ngrams:
h = orig
# iterating over i, not word, to notice the last i
for i in range(N):
word = ngram
if word not in h:
if i < N1: # (*)
h[word] = {}
else:
h[word] = 0
if i < N1:
h = h[word]
print i, h

h[word] += 1

print orig


-- e.g., perhaps we could do short-circuit vivification to the end in
(*)?

Cheers,
Alexy
 
S

Scott David Daniels

braver said:
...
The real-life motivation for this is n-gram counting. Say you want to
maintain a hash for bigrams. For each two subsequent words a, b in a
text, you do
bigram_count[a] += 1


This application is easily handed with tuples as keys.

bigrams = {}
src = iter(source)
lag = src.next()
for current in src:
bigrams[lag, current] = bigrams.get((lag, current), 0) + 1
lag = current

But if you really want nested:

bigrams = {}
src = iter(source)
lag = src.next()
for current in src:
count = bigrams.setdefault(lag, {}).get(current, 0)
bigrams[lag][current] = count + 1
lag = current

-Scott David Daniels
(e-mail address removed)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,739
Latest member
Clint8040

Latest Threads

Top